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SUMMARY 


This  final  report  described  the  ASSERT  project  “Detection  and  Classification  of  Synthetic 
Aperture  Radar  Targets”  associated  with  the  URI  Automatic  Target  Recognition  (ATR) 
project  sponsored  by  DARPA.  The  main  goal  of  this  ASSERT  project  together  with  the 
URI-ATR  project  is  to  develop  detection  and  classification  algorithms  for  automatic  target 
recognition.  For  the  ASSERT  project,  we  have  focused  on  the  use  of  Bayesian 
probabilistic  reasoning  approach  to  fuse  multiple  target  feature  data  for  the  purpose  of 
target  classification.  We  also  developed  Bayesian  network  learning  algorithms  to 
automatically  construct  the  Bayesian  network  model. 

In  this  project,  there  were  two  graduate  students  and  one  undergraduate  students 
participated  in  the  technical  work.  Of  whom,  two  of  them  have  received  M.S.  degrees  and 
one  of  them  is  continuing  his  Ph  D.  degree.  This  project  directly  or  indirectly  supported 
the  publications  of  eight  technical  papers,  two  Master  thesis,  one  Ph.D.  thesis,  and  one 
technical  report. 


The  view,  opinions  and/or  findings  contained  in  this  report  are  those  of  the  author(s)  and 
should  not  be  construed  as  an  official  Department  of  the  Army  position,  policy,  or 
decision,  unless  so  designated  by  other  documentation. 


1.  Introduction 


To  succeed  on  the  battlefield,  it  is  very  important  to  have  an  accurate  picture  of  the 
current  tactical  picture  of  the  situation.  Modem  sensor  technology  such  as  SAR  has  vastly 
improved  the  quantity  and  quality  of  the  raw  information  on  which  tactical  intelligence  is 
based.  SAR  signatures  contain  coherent  noise  and  have  many  unknown  parameters  such 
as  amplitude,  target  spatial  position  and  target  orientation.  The  SAR  clutter  statistics  are 
at  least  somewhat  unknown  and  the  signal  is  nonstationary.  Our  research  focus  is  to  build 
on  our  development  of  new  detection  and  classification  algorithms  for  SAR  target  data. 
The  major  goal  of  this  research  was  to  develop  Bayesian  Network  algorithms  appropriate 
for  SAR  ATD/R  and  to  demonstrate  the  performance  of  these  algorithms. 

Problem  -  A  high  performance  SAR  target  detection  and  classification  system  will  make 
use  of  multiple  features  and  algorithms.  The  system  requires  an  approach  for  combining 
feature  and  algorithm  output  information  to  make  intermediate  and  final  decisions  about 
the  presence  and  identities  of  targets. 

Goal  -  Develop  Bayesian  Network  and  other  algorithms  appropriate  for  the  SAR  ATR 
problem  that  can  handle  non-Gaussian  features  optimally.  Demonstrate  this  algorithm 
against  the  SAR  ATR  problem. 

Approach  &  Objectives  -  We  have  worked  on  the  problem  of  target  discrimination  and 
classification  using  a  Bayesian  Network  whose  input  is  a  collection  of  target  features. 
These  target  features  include  the  features  of  the  Lincoln  Laboratory  discriminator.  We  are 
have  also  examined  other  applicable  approaches  such  as  multi-polarization  fusion  as  well 
as  multi-resolution  fusion  using  wavelet. 

2.  Technical  Approach 

The  objectives  of  the  current  research  in  ATR  are  to  determine  techniques  for 
understanding  the  nature  and  special  features  of  a  SAR  image  and  use  those  to  develop 
specific  identification  techniques  for  classification.  Particularly,  we  are  interested  in  using 
the  Bayesian  network  technology  to  improve  ATR  performance. 


# 


# 


During  the  past  few  years,  Bayesian  networks  have  received  much  attention  as  an  efficient 
way  of  reasoning  under  uncertainty.  Such  networks  provide  a  probabilistic  model  of  the 
problem  by  means  of  graphs.  The  nodes  correspond  to  the  variables  of  interest,  the  states 
in  a  node  can  be  either  discrete-valued  or  continuous-valued.  The  arcs,  usually  given  in 
terms  of  conditional  probabilities,  represent  the  probabilistic  relationship  between  nodes. 
The  networks  as  a  whole  can  be  used  to  represent  various  complicated  models  such  as 
target  and  sensor  models. 

Once  a  Bayesian  network  has  been  used  to  represent  the  model  of  a  problem,  the  inference 
problem  is  to  determine  the  a  posterior  probability  distribution  of  the  state  given  the 
observed  evidence.  Many  techniques  have  been  developed  for  performing  Bayesian 
network  inference.  These  include  methods  based  on  graph-theoretic  implementations  of 
Bayes  rule  and  marginalization,  methods  based  on  message  passing  and  clustering,  and 
other  approximate  methods  based  on  simulation. 

2.1  Bayesian  Networks  Representation  and  Algorithms 

The  Bayesian  network  technology  is  a  set  of  modeling  (representation)  techniques  for 
encoding  large-scale  systems  with  inter-dependent  uncertain  elements  into  well-structured 
probability  spaces,  coupled  with  a  set  of  inference  techniques  to  obtain  a  posterior 
probabilistic  assessments  given  available  data.  As  a  new  technology,  the  Bayesian 
network  has  been  shown  to  be  both  computationally  more  tractable  and  more  easily 
understood  than  its  predecessor  technology.  These  advantages  are  achieved  primarily 
through  one  technical  innovation,  the  representation  of  conditional  independence 
relationships.  Such  relationships  limit  the  information  used  in  an  assessment  or  decision  to 
that  which  is  directly  relevant  and  therefore  improve  both  the  efficiency  of  the 
representation  and  the  inference  process. 

Bayesian  networks  provide  a  flexible  representation  for  complex  models,  and  efficient 
inference  algorithms.  A  Bayesian  network  is  an  acyclic  directed  graph  in  which  each  node 
in  the  graph  is  a  random  variable.  The  nodes  in  the  graph  collectively  satisfy  a  certain 
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Markov  Property,  i.e.,  the  predecessors  of  a  node  "separate"  it  from  the  nodes  preceding 
those  predecessors.  In  a  Bayesian  network,  the  existence  of  an  arc  between  two  nodes 
indicates  a  potential  stochastic  dependence  between  the  two  random  variables  represented 
by  the  two  nodes. 

In  addition  to  the  convenient  and  flexible  representation,  a  major  benefit  of  using  Bayesian 
networks  is  the  existence  of  many  powerful  probabilistic  inference  algorithms  developed  in 
the  past  few  years.  In  ATR,  the  goal  of  inference  is  to  update  beliefs  in  particular  target 
classification  in  the  light  of  the  current  state  of  information  and  new  evidence  about  a 
target.  The  updated  beliefs  are  known  as  posteriors,  while  the  state  of  information  before 
the  evidence  is  known  as  priors.  Since  the  nodes  of  a  network  are  its  most  basic  unit,  it  is 
often  desired  to  know  the  posterior  distribution  for  each  node  in  the  graph.  Many 
algorithms  are  designed  to  handle  this  particular  query.  They  include  the  distributed 
algorithm,  the  influence  diagram  algorithm,  the  evidence  potential  algorithm,  and  the 
symbolic  probabilistic  inference  (SPI)  algorithm. 

■  2.2  Learning  Bayesian  Networks  from  Data 
For  target  recognition,  a  Bayesian  network  can  be  constructed  either  by  expert  knowledge 
or  learned  from  the  training  database.  Bayesian  methods  for  learning  networks  from  data 
take  prior  knowledge  and  combine  it  with  data  to  produce  one  or  more  Bayesian 
networks.  The  philosophy  of  Bayesian  learning  methods,  in  principle,  is  based  on  a  so 
called  score  function  which  is  proportional  to  the  posterior  probability  p(Bs|D)  of  a 
network  structure  Bs  given  database  D.  Generally,  the  score  function  is  derived  according 
to  a  number  of  assumptions  on  the  underlined  probabilistic  model.  A  Bayesian  structure 
Bs  that  maximizes  the  score  function  is  considered  as  the  most  possible  structure 
generating  the  database. 

During  past  years,  several  methods  have  been  developed  for  learning  Bayesian  networks 
from  a  given  database.  Some  of  these  algorithms,  due  to  their  inherent  nature,  are 
computational  intensive,  and  others  which  employing  a  greedy  search  heuristic  can  not 


guarantee  to  converge  to  the  right  network,  even  if  the  sample  size  is  sufficiently  large. 
Our  research  focus  partially  on  developing  efficient  methods  for  learning  Bayesian 
networks.  A  number  of  attributes  about  a  Bayesian  network  and  learning  metric  are 
identified.  Based  on  these  properties,  we  developed  new  learning  algorithms  which  can  be 
shown  to  be  computationally  efficient  and  guarantee  the  resulting  network  converging  to  a 
right  network  given  a  sufficiently  large  sample  size. 

3.  Accomplishments  and  Issues 

(A)  Major  Accomplishments  of  the  Project 

•  Convert  Xpatch  multi-polarization,  multi-fi'equency  synthetic  SAR  data  for  four 
targets  into  image  chips  and  wrote  MATLAB  code  to  display  the  images.  A  total  of 
3600  images  for  each  target  are  created. 

•  Develop  multi-polarimetric  fusion  algorithm  for  SAR  target  classification  [1,5]. 

•  Conduct  research  on  useful  target  features  to  use  in  our  Bayes  network  algorithm  for 
target  discremination  and  classification  [2]. 

•  Develop  and  implement  Bayesian  Network  inference  algorithms  for  generic  networks 

m. 

•  Develop  and  test  a  wavelet-based  feature  identification  and  fusion  algorithm  [3]. 

•  Test  the  Bayesnet  algorithms  against  the  real  feature  data  provided  by  Lincoln  Lab. 
and  obtain  very  good  performance  [2,6]. 

•  Developed  learning  algorithms  to  construct  Bayesnet  from  data  automatically  [4,8]. 


(B)  Project  Issues 


Since  the  parent  project  (DARPA,  URI-ATR)  was  re-directed  and  extended  (without 
additional  funding)  to  July  1997,  we  therefore  requested  a  no-cost  extension  in  June, 
1996,  for  this  accompanied  ASSERT  project.  We  received  an  approval  for  no-cost 
extension  of  this  project  to  Feb.  1997,  and  subsequently  to  July  3 1,  1997. 
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Abstract — The  problem  of  target  classification  using  Synthetic  Aperture  Radar  (SAR)  polarizations  is 
considered  from  a  Bayesian  decision  point  of  view.  This  problem  is  analogous  to  the  multi-sensor  problem.  We 
investigate  the  optimum  design  of  a  data  fusion  structure  given  that  each  classiher  makes  a  target  classification 
decision  for  each  polarimetric  channel.  Though  the  optimal  structure  is  difficult  to  implement  without  complete 
statistical  information,  we  show  that  significant  performance  gains  can  be  made  even  without  a  perfect  model. 
First,  we  analyze  the  problem  from  an  optimal  classification  point  of  view  using  a  simple  classification  cxan^le 
by  outlining  the  relationship  between  classification  and  fusion.  Then,  we  demonstrate  the  performance 
improvement  on  real  SAR  data  by  fusing  the  decisions  from  a  Gram  Schmidt  image  classifier  for  each 
polarization. 

Radar  target  classification  Automatic  target  recognition  Synthetic  aperture  radar 

Feature  fusion 


1.  tSTRODUenON 

There  has  been  a  strong  interest  in  employing 
multiple  sensors  for  surveillance  detection  and  recogni¬ 
tion  for  a  number  of  years.  Some  of  the  motivating 
factors  for  this  interest  are:  increased  target  illumina¬ 
tion,  increased  coverage,  and  increased  information  for 
recognition.  For  military  applications  combining  several 
sensors  or  multiple  looks  of  the  same  sensor  is  an 
effective  method  for  increasing  recognition  performance 
while  providing  strength  for  resisting  environmental 
effects  and  countermeasures.  The  approach  we  outline 
offers  advantages  that  can  be  extended  to  many 
applications  in  multisensor  surveillance.  We  have 
chosen  to  demonstrate  on  an  application  that  is 
particularly  challenging  and  not  fully  explored,  i.e.  the 
fusing  of  multi-polarimetric  classification  decisions 
from  a  Synthetic  Aperture  Radar  (SAR)  sensor. 

In  this  paper  a  two-level  decision  problem  is  outlined. 
The  first  level  is  the  single  source  classification  solution. 
The  second  level  is  where  a  fusing  focal  point  receives 
each  decision,  where  each  decision  is  made  individually 
and  independent  of  the  others.  We  analyze  this  fi*om  an 
optimality  criterion  and  assume  that  the  sources  transmit 
their  decision  instead  of  raw  data.  The  optimal  decision 
fusion  test  is  outlined  and  demonstrated  with  several 
examples.  Instead  of  using  a  binary  decision  alone  in  the 
fusion  process,  we  also  utilize  the  value  of  the  decision 
statistic  for  maximum  information  usage.  As  a  result,  we 
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show  that  the  classification  pierformance  can  signifi¬ 
cantly  improve  a  single  source  performance.  Examples 
for  the  Gaussian  case  are  provided  and  the  implications 
for  more  realistic  non-Gaussian  sensor  data  are 
discussed. 

The  approach  we  developed  offers  many  advantages 
for  multiple  sensor  surveillance  applications.  The  main 
advantage  is  the  algorithm  design  and  implementation, 
where  each  sensor  classifier  can  be  designed  and 
optimized  independent  of  tlje  others.  In  fact,  we 
demonstrate  our  approach  with  a  particularly  challen¬ 
ging  fully  polarimetric  SAR  example  where  the 
polarimetric  charmels  are  inherently  more  correlated 
than  the  sources  from  independent  sensors  would  be. 

This  work  is  closely  related  to  the  distributed  multiple 
sensor  detection  problem  which  has  been  reported  in 
references  (1-4).  Tenney  and  Sandell^*^  developed  a 
theory  for  obtaining  the  distributed  Bayesian  detection 
rules.  They  derived  the  decision  rules  for  the  individual 
detectors  that  are  coupled  with  information  from  the 
other  sensors.  This  decision  process  is  very  difficult  to 
perform  when  the  relationship  among  sensor  data  is 
unknown.  Chair  and  Varshney^^^  presented  an  optimum 
fusion  structure  given  that  the  detectors  were  indepen¬ 
dently  designed.  The  solution  was  then  used  for  a 
Neyman-Pearson^^^  lest.  Surprisingly,  this  type  of  fusion 
construction  has  not  been  applied  to  the  problem  of 
multi-polarimetric  channel  SAR  imagery.  The  outline  of 
this  paper  will  fust  cover  fusion  and  single  source 
(single  polarization,  sensors,  etc.)  preliminaries:  then  the 
fusion  algorithm  will  be  demonstrated  on  the  two-class 
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Gaussian  classification  problem;  and  finally  the  fusion 
paradigm  will  be  demonstrated  on  fully  polarimetric 
SAR  data  in  conjunction  unth  a  Gram  Schmidl  image 
feature  selection  method. 

2.  PRELIMINARIES 


p{x  I  Wi)P(w,) 


(3) 


^,(jr)=p(;rHV()P(*v,)  (4) 

gi{x)  =  log  p(x  I  lV,)logP(M-,)  (5) 


There  are  two  major  options  for  decision  making  with 
multiple  sources.  The  first  option  provides  complete 
sensor  information  to  a  centralized  processor.  This  is 
sometimes  referred  to  as  feature  level  fusion  because  all 
the  feature  information  is  transmitted  to  the  fusion 
center  for  a  combined  decision.  The  second  option  is  to 
have  a  decentralized,  decision-level  fusion.  That  is» 
some  or  all  of  the  signal  processing  is  performed  at  each 
individual  source  and  only  local  decisions  are  used  for 
fusion. 

The  second  option  is  more  attractive  for  many 
applications  due  to  the  fact  that  in  the  first  option  the 
likelihood  functions  require  knowledge  of  the  joint 
feature  distribution  among  sensors,  p{x  \  where 

Wi  is  the  rth  class,  Sj  is  the  jxh  sensor  and  x  represents  the 
feature.  This  is  difficult  to  obtain  even  for  similar 
sensors,  because  the  detection  thresholds  at  individual 
detectors  are  usually  coupled,  i.e.  they  are  not 
independent.  Also,  the  second  option  is  more  desirable 
due  to  its  cost,  survivability,  and  bandwidth  considera¬ 
tions.  Nevertheless,  the  trade-off  of  the  decision-level 
simplicity  is  a  loss  in  optimality  if  the  data  is  not  or 
cannot  be  uncoupled. 

In  this  paper,  we  consider  the  problem  of  making  a 
decision  between  two  hypotheses.  A  number  of  N 
sensors  receive  observations  and  independently 
implement  a  local*  test.  Let  Uj  designate  the  decision 
of  the  sensor,  having  taken  into  account  all  the 
observations  available  to  this  sensor  at  the  time  of  the 
decision.  Every'  sensor  transmits  its  decision  to  the 
fusion  center,  so  that  the  fusion  center  has  all  N 
decisions  available  for  processing  at  the  time  of  the 
decision  making. 

Before  proceeding  with  the  fusion,  we  first  consider 
the  single  sensor  problem.  To  minimize  the  average 
probabiliiy  of  error^,  we  should  select  the  class  that 
maximizes  the  a  nor  probability  P(w,  !  x).  In  other 
word.N.  for  the  multi -category  case,  we  minimize  the 
error  rate; 

Decide  u-,  if  P{w\  :  x)  >  |  x)  for  all  j  ^  i.  (1) 

Nov.  representing  the  classifier  in  terms  of  the 
discriminant  functions  g,(.r),  any  of  the  following 
choice.'^  can  be  derived^*^^  giving  identical  classification 
resuiLs.  but  some  can  be  simpler  to  understand  or  to 
compute  than  others: 

g,(jr)  =  P(h',  I  jc)  (2) 


Mn  our  examples.  local  class  parameters  (feature  disuibu- 
tjon.<ii  are  estimated  using  training  data. 

theoretical  optimality  can  only  be  achieved  when  all  class 
parameter  distributions  are  known. 


where  are  the  class  prior  probabilities. 

While  the  two-category  case  is  just  a  special  instance 
of  the  discriminant  function,  it  can  be  written  in  terms  of 
a  single  function.  The  following  is  particularly  con¬ 
venient  and  will  be  used  throughout  the  paper: 


g,(x)  =  log 


p(x  I  W2) 


+  log 


P(Hl) 
P{W2)  • 


(6) 


3.  Data  fusion 


The  data  fusion  problem  can  be  viewed  as  an  m- 
hypothesis  problem  with  individual  source  decisions 
being  the  observations.  The  first  fusion  rule  we  will 
discuss  was  given  by  Chair  and  Varshney^^^  who  showed 
the  Bayesian  optimum  rule  as  a  weighted  sum  of  local 
decisions.  The  weights  being  functions  of  local  false 
alarm  rates  and  probabilities  of  detection.  Each  detec¬ 
tion  is  assumed  to  be  statistically  independent  and  each 
detector  makes  a  binary  decision  where  i=l,. .  .,n 

1 ,  if  //o  is  declared 
-hi,  if is  declared 


After  processing  the  observations  locally  the  decisions, 
Uiy  are  transmitted  to  the  fusion  processor.  The  structure 
of  this  fusion  rule  is  derived  using  the  discriminant 
function  discussed  earlier  equation  (2), 


g(u)  =  log 


I  u) 
P(Ho  I  u) 
P, 


5- 


(7) 


where  is  the  set  of  all  i  such  that  u,=-hl,  5_  is  the 
set  of  all  i  such  that  w,=  -l.  Pp,  and  Pd,  are  false 
alarm  rates  and  probabilities  of  detections  of  each  local 
sensor,  and  P\  and  Pq  are  the  a  priori  probabilities  of  the 
two  hypotheses.  In  the  case  where  all  the  sensors  are 
similar  and  operate  at  the  same  error  probability  level, 
i.e.  Pp  =  Pf  and  Pp,  =  Pp  for  every  sensor,  equa¬ 
tion  (6)  is  particularly  easy  to  analyze  because  the 
decision  variable  has  a  binomial  distribution  that  is 
distributed  k  out  of  N  decisions  for  favoring  a 
hypothesis.  Hence,  the  fusion  center  probability  of 
errors,  are  and  Pp 

i=-T]  ^  ^  ' 

where  fT]  indicates  the  smallest  integer  exceeding  7, 
the  decision  threshold  of  k. 
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ROC  compatsion  ot  hisw 


Thomopoulos  et  alP^  obtained  similar  results  using 
the  N-P  criterion  in  addition  to  showing  improved 
performance  by  transmitting  a  single  confidence  bit 
along  with  the  hard  decision.  We  follow  basically  the 
latter  approach,  but  do  not  restrict  the  fusion  rule  to  a 
single  confidence  bit.  Lee  and  Chao^^^  observed  that  by 
using  a  3~bit  fusion  paradigm  the  detection  performance 
of  the  system  was  nearly  optimal  (i.e.  the  performance 
of  an  optimum  centralized  system).  The  binary  decision 
process  certainly  implies  an  information  loss,  but  is  used 
for  distributed  sensor  problems  where  the  communica¬ 
tion  channel  is  an  issue.  For  our  application  we  are  not 
hindered  by  the  communication  channel.  Specifically, 
we  assume  that  the  decision  statistic  as  well  as  the 
decision  threshold  of  each  local  detector  are  given.  We 
investigate  two  fusion  algorithms  based  on  this  assump¬ 
tion.  The  first  method  is  the  MAP  rule  introduced  in 
equation  (1)  as  a  classifier,  which  applies  equally  well 
as  a  fusion  algorithm,  i.e.; 

Decide  w,  if  F(h’,  |  u)  >  P{\^'i  I  w)  for  all  j  7^  i.  (9) 

Namrallv,  equation  (9)  is  the  desired  implementation 
because  it  provides  an  optimal^  solution.  However,  in 
general,  we  may  not  be  able  to  accurately  estimate  the 
feature  distribution  functions.  Tberefore,  the  second 
method  is  a  heuristic  approach  that  directly  extends  the 
optimal  binary  approach  introduced  by  Varshney 
[equation  (7)].  The  fusion  algorithm  is  the  same  but 
the  decision  region  is  expanded  to  include  the  full 
threshold  range: 

5(u)  =  iosp-+y^iog^+y^iog|-— ^  (10) 

^0  A,  -n  Ac  ^ 

where  is  the  set  of  all  i  such  that  {gi(x)  >  T,}  and 


^ Again,  optimality  can  only  be  achieved  when  all  feature 
distributions  are  known. 


with  Ti  being  the  individual  source  threshold  for 
partitioning  the  decision  regions,  and  are  the  pro¬ 
babilities  defined  by  the  Cumulated  Probability  Func¬ 
tions  (CDFs)  for  the  each  decision  statistic,  e.g. 
pf  =z  P(ui  >  gi(x)  I  wi).  This  is  simply  an  extension 
of  equation  (6)  to  include  more  than  binary  decision 
information.  In  practice,  the  CDFs  will  be  quantized  and 
estimated  from  training  on  the  individual  sensor  s 
classifier  error  probabilities,  which  is  the  same  for 
equation  (9),  but  the  advantage  here  is  if  the  estimates 
are  not  ideal,  then  the  added  information  from  passing 
the  decision  partitions  can  provide  some  strength  against 
environmental  uncertainties.  In  a  distributed  scenario 
the  weighting  can  be  computed  at  each  sensor  and 
transmitted  to  the  fusion  center  where  they  will  be 
summed  and  compared  to  the  decision  threshold. 

To  demonstrate  a  simple  example  we  consider  a 
system  of  three  sensors.  N=3,  where  the  observation  of 
each  sensor  is  distributed  normally  as  A^(0,1)  for  Ho  and 
H(l,l)  for  Hi.  We  compare  the  performance  with  the 
best  centralized  scheme,  which  utilizes  raw  data,  not 
decisions,  from  the  different  sensors.  Figure  1  displays 
the  single  sensor  performance,  the  optimal  binary  fusion 
[equation  (7)],  and  the  MAP  fusion. 

Up  to  this  point  we  have  looked  at  the  design  of  a 
decision-level  fusion  scheme.  The  present  analysis  can 
be  extended  in  many  directions.  Equation  (7)  is  certainly 
not  difficult  to  implement,  since  it  only  processes  a 
single  hard  bit  decision.  Let  us  examine  the  extra 
difficulty  in  implementing  equations  (9)  and  (10)  using 
the  full  range  of  decision  statistics  from  training.  This  is 
not  the  typical  difficulty  associated  with  the  classical 
methods  that  attempt  to  assimilate  the  likelihood  ratio 
solution  because  the  probability  distributions  for  each 
hypothesis  must  be  defined.  In  the  next  section  we  apply 
the  decentralized  scheme  to  fuse  the  returns  of  a  fully 
polarimetric  SAR  classifier.  With  three  linear  polariza¬ 
tions  which  are  known  to  have  low  correlation,  the  gain 
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is  shown  to  be  considerable.  First,  we  introduce  the 
classification  procedure  that  will  be  used  for  each 
polarization  and  then  we  discus?  the  fusion  methods  that 
operate  on  each  classified  decision. 

3.1.  Description  of  the  classifier  and  fusion 
implementation 

A  simplified  block  diagram  of  a  complete  multi-stage 
ATR  system  is  shown  in  Fig.  2.  The  first  stage  locates 
regions  of  interest  that  contain  “target-like”  features. 
The  goal  of  this  stage  is  to  rapidly  sift  through  the  sensor 
imagery  to  make  quick  and  computationally  efficient 
decisions  while  still  keeping  the  false  alarm  rate  at  a 
reasonable  level.  Although  the  first  stage  is  still  an 
active  research  problem,  this  paper  focuses  on  the  latter 
two  stages  (target  classification  and  fusion)  where  the 
burden  of  the  final  decision  making  process  lies.  For  the 
classifier  we  use  a  signal  subspace  technique  popular¬ 
ized  in  communications  to  get  target  features  where 
each  signal  is  defined  by  a  separate  basis  function  in  an 
orthonormal  set.  If  this  set  can  be  described,  then  the 
optimum  classifier  easily  follows,  hence  the  minimum 
probability  of  error.  One  method  of  determining  a  set  of 
orthonormal  basis  functions  from  a  signal  set  is  by  using 
the  Gram-Schmidt  orthonormalization  procedure.  This 
is  similar  to  several  orthogonal  methods  such  as  the 
Karhunen-Loeve  expansion.  The  Gram-Schmidt  basis 
functions  offer  a  reduced  yet  complete  signal  set.  For 
this  preliminaiy  study,  we  use  a  target  set  having  an 
angular  coverage  of  approximately  68*".  The  target 
images  are  comprised  of  32x32  pixel  Lincoln  Labor- 
ator%'’s  millimeter  wave  SAR  sensor  data  in  the  spotlight 
mode.  The  images  are  separated  by  one  degree  azimuth 
increments;  thus,  we  have  64  images  of  each  target  type. 
We  use  a  subset  for  designing  the  classifier.  The 
classifier  is  designed  by  processing  a  target  set  through 
an  orthonormalization  Gram—Schmidt  analysis  to  deter¬ 
mine  a  subset  that  best  accounts  for  the  characteristics 
of  all  the  image.s  within  the  target  space.  Let  the  design 
image.s  be  denoted  by  vectors  X\y  X^’.  The 

residuals  of  each  image  are  combined  into  a  matrix 
formed  from  the  vectors  as 

X  =  {XuX2,..-.X^].  (11) 

The  Gram-Schmidt  decomposition  process  requires  that 
we  choose  a  subset  of  images  that  provides  the  best 
repre.sentaiion  of  the  target  set.  We  refer  to  this  subset  of 


images  as  the  Gram-Schmidt  (GS)  set.  The  GS 
algorithm  is  described  in  many  linear  algebra  Icxt- 
books^*^^  In  our  case  an  orthonormal  set  Q  is  produced 
by  the  algorithm  from  X  to  make  up  the  components  of 
the  newly  reconstructed  design  set  T,  such  that: 

r.  =  £G,  (12) 

i=l 

where  N  is  the  rank  of  Q. 

A  subset  consisting  of  N  images  per  target  is  selected 
for  the  classifier  design.  For  the  68  target  chips,  a  design 
chip  was  selected  eveiy*  other  degree.  Figure  3  shows  a 
transformation  example  for  the  third  GS-element  image, 
the  corresjKinding  basis  image  (third  element  of  a  GS 
set  of  eight),  and  K3  =  Ci +22+23-  Visually,  we  can  see 
how  the  combined  is  nearly  a  perfect  reconstruction 
of  the  original  image  X3. 

The  classifier  operates  on  the  unknown  data  using  the 
GS  set  as  follows.  A  test  image  X,  is  projected  onto  the 
N  dimensional  space  spanned  by  the  GS  set.  The 
components  of  the  projection  are  calculated  as  follows: 

e.  =  (x,-X).y7  (13) 

where  Xt  is  the  residual  test  image.  This  test  will 
determine  if  there  is  enough  energy  contained  in  the  test 
image  to  fall  within  one  of  the  target  classes.  The  scalar 
components  are  placed  into  an  N  dimensional  feature 
vector 

er  =  [el,e2,...,e^]  (i4) 

The  max(©,)  for  i=l,...,V  is  the  distance  decision 
measure  corresponding  to  the  best  projection  onto  the 
GS-set.  Hence,  if  the  GS-set  completely  characterizes 
the  target  set  then  all  of  the  target  images  should  project 
onto  one  of  the  N  vectors  completely.  Also,  this 
procedure  was  tested  against  a  “cultural”  clutter  chip 
set  in  the  same  manner.  Referring  to  Fig.  2,  these 
chips  were  passed  through  a  stage  1  prescreener 
developed  by  Lincoln  laboratory.^^  These  consist  of 
100  challenging  false  alarms.  Fig.  4  shows  each  single 
channel  HH,  HV,  and  VV  performance  of  the  classifier 
operating  against  all  the  100  false  alarm  chips  and 
the  68  target  chips.  Larger  GS -set’s  were  tested  with 
improved  performance,  but  the  goal  was  to  retain  a 
reasonably  small  filter  set.  Also,  other  researchers  have 
reported  algorithm  performance  on  this  data  sei^\  e.g. 
an  eigen-image  approach,  a  quadratic  distance  correla¬ 
tion  classifier,  and  a  shift  invariant  2D  pattern  matcher. 


^CLASSIFIES 

TARGETS 
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AND  REJECTS 
CLUTTER  FAI-SF 
ALARMS 
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TARGET 
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Fig.  2.  Block  diagram  of  Automatic  Target  Recognition  (ATR)  system. 
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Fig.  3.  GS  orthogonal  composition  example. 


Our  algorithm  reported  very  similar  performance  for  a 
single  channel  classifier.^ 


'‘All  four  algorithms  reported  similar  performance  Vr-ithin  one 
or  two  false  alarms.  There  was  not  enough  target  chips  to  draw' 
conclusive  confidence. 


Nevertheless,  by  combining  the  classification  deci¬ 
sions  using  the  rules  such  as  equations  (9)  and  (10), 
further  refining  of  classification  performance  can  be 
obtained,  in  our  next  experiment,  we  demonstrate  this 
by  combining  the  three  classifier  decisions  from  each 
polarization  into  the  heuristic  fusion  rule.  A  previous 
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Table  ) 

.  Minimum  average  probability  of  errors. 

Bayes  error 

HH 

6.289f 

HV 

4.009r 

VV 

2.509^ 

Fusior. 

analysis'^’  determined  the  polarizations  to  be  marginally 
correlated,  which  indicated  the  possibility  for  a  decision 
level  fusion  success.  Each  polarization  channel  was 
classif:ed  by  die  GS  classifier  to  determine:  first,  a 


minimum  probability  of  error  threshold  and  secondly,  to 
estimate  the  CDFs  by  using  a  Gaussian  approximation 
from  the  classifier  decisions,  as  are  required  by  the 
weighting  functions  [equation  (10)].  Figure  4  displays 
the  improved  performance  resulting  from  fusing  the 
polarization  decisions,  and  Table  I  shows  their  mini¬ 
mum  average  probability  of  errors^.  By  choosing  a 
larger  GS  set  the  performance  improved,  but  we  desired 


^Average  of  missed  detection  probability  and  false  alarm 
probability 
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a  less  than  perfect  classifier  to  fully  demonstrate  the 
fusion  benefit. 

In  the  second  experiment,  instead  of  using  the 
Gaussian  approximation  for  the  distribution,  we  formed 
an  empirical  distribution  from  the  histogram.  In  this 
case  the  MAP  fusion  performed  significantly  better  as 
indicated  in  Fig.  5. 

4.  CONCLUSION 

We  have  studied  a  two-level  decision  system  in  which 
the  local  decision  statistics  and  their  performance 
characteristics  are  given.  Instead  of  using  a  binary 
decision  alone  in  the  fusion  process  we  utilize  the  value 
of  the  decision  statistic  for  maximum  information  usage. 
As  a  result,  we  show  by  example  that  the  classification 
performance  can  significantly  improve  a  single  source 
performance.  Also,  we  developed  practical  implementa¬ 
tions  to  improve  the  binary  fusion  strategy.  The 
approach  we  developed  offers  many  advantages  for 
multiple  sensor  surveillance  applications.  The  main 
advantage  is  the  algorithm  design  and  implementation, 
where  each  local  classifier  can  be  designed  and 
optimized  independent  of  the  others.  We  demonstrated 
our  approach  with  a  particularly  challenging  fully 
polarimetric  SAR  example  where  the  polarimetric 


channels  are  inherently  more  correlated  than  the  sources 

from  independent  sensors. 

REFERENCES 

1.  R.  R.  Tenney  and  N.  R.  Sandell,  Jr.,  Detection  with 
distributed  sensors,  IEEE  Trans.  Aerospace  and  Electronic 
Systems  AES-17,  501-510  (1981). 

2.  Z.  Chair  and  P.  K.  Varshney,  Optimal  data  fusion  in 
multiple  sensor  detection  systems ,  IEEE  Trans.  Aerospace 
and  Electronic  Systems  AES-22(1),  98-101  (1986). 

3.  S.  C.  Thompoulos,  R.  Viswanathan  and  D.  Bougoulias 
Optima]  decision  fusion  in  multiple  sensor  systems,  IEEE 
Trans.  Aerospace  and  Electronic  Systems  AES-23(5),  644— 
652  (1987). 

4.  R.  Duda  and  P.  Hart  Pattern  Classification  and  Scene 
Analysis,  p.  32,  Wiely,  New  York  (1973). 

5.  C.  C.  Lee  and  J.  J.  Chao,  Optimum  local  decision  space 
paritioning  for  distributed  detection,  IEEE  Trans.  Aerospace 
and  Electronic  Systems,  AES25  4,  536-544  (1989). 

6.  J.  Wozcncraft  and  I.  Jacobs,  Principles  of  Communication 
Engineering,  pp.  266-272,  Prospects  Heights,  HI.,  Wave- 
land,  (1965). 

7.  L.  M.  Novak  and  G.  J.  Owiika,  Radar  target  identification 
using  an  eigen-image  approach,  IEEE  National  Radar 
Conference,  Atlanta,  GA,  March  (1994). 

8.  A.  Hauler,  A  williams,  G.  Orsak,  and  V.  Diehl,  Benchmark¬ 
ing  ATR  test  data,  ATR  Systems  and  Technology  Proceed¬ 
ings,  November(1994). 


About  the  Author — ANDREW  HAUTER  received  the  M.S.  Degree  in  Electrical  and  Computer  Engineering 
from  George  Manson  University  (GMU)  in  1995  and  is  currently  a  Ph.D.  student  in  Computational  Sciences  and 
Informatics  (SCI)  in  GMU.  His  research  interest  include  multiresolution  approaches  for  modeling  multi- 
hypothesis  target  classes;  decentralized  fusion  for  combining  multi-feature  and  multi-sensor  classifier  decisions; 
and  non-parametric  classification  strategies. 


About  the  Author — KUO-CHU  CHANG  received  the  B.S.  degree  in  Communication  Engineering  from  the 
National  Chiao-Tung  University,  Taiwan,  in  1979  and  the  M.S.  and  Ph.D.  degrees  both  in  Electrical  Engineering 
from  the  University  of  Connecticut  in  1983  and  1986  respectively.  From  1983  to  1992,  he  was  a  senior  research 
scientist  in  Advanced  Decision  Systems  f  ADS)  division,  Booz-AUen  and  Hamilton,  Mountain  Mew.  California. 
He  jointed  the  Systems  Engineering  department,  George  Mason  University  in  1992  as  an  associate  professor.  His 
research  interests  include  estimation  theory,  optimization,  signal  processing,  and  data  fusion.  He  is  particularly 
interested  in  applying  unconventional  techniques  in  the  conventional  decision  and  controls  systems.  He  has 
published  more  than  fifty  papers  in  the  areas  of  multitarget  probabilistic  tracking,  distributed  sensor  fusion,  and 
Bayesian  probabilistic  inference.  He  is  currently  an  editor  on  navigation/tracking  systems  for  IEEE  Transactions 
on  Aerospace  and  Electronic  Systems.  Dr.  Chang  is  also  a  member  of  Etta  kappa  Nu  and  Tau  Beta  Pi 


About  the  Author — SHERMAN  KARP  received  the  BSEE  and  MSEE  from  MTT,  in  1960  and  1962,  and  Ph.D. 
from  University  of  Southern  Cahfomia  in  1967.  From  1978  to  1981  he  was  a  principle  scientist  in  the  Strategic 
technologies  Office  in  DARPA  where,  among  other  things,  he  inverted  the  bluc/grccn  laser  satellite  to  submarine 
optical  communications  program  and  was  principal  for  the  development  of  the  mini -GPS  system.  Since  1986  he 
has  been  serving  as  a  consultant  to  both  government  and  industry  in  GaSa  systems,  communication  systems, 
electro-optic  systems,  radar,  geo-location  systems,  and  detection  and  recognition  ATR  algorithm  development. 
He  recently  pioneered  the  development  of  the  Soldier  911  geo-location  reporting  system. 


Feature-based  target  recognition  with  a  Bayesian 
network 


Jun  Liu 

Kuo-Chu  Chang 

George  Mason  University 
School  of  Information  Technology  and 
Engineering 

Center  of  Excellence  in  Command,  Control, 
Communications,  and  Intelligence 
(C"l) 

Fairfax,  Virginia  22030 
E-mail:  jllu@c3i.gmu.edu 


Abstract.  The  problem  of  target  classification  with  high-resolution,  fully 
polarimetric,  synthetic  aperture  radar  (SAR)  imagery  is  considered.  We 
propose  a  framework  of  using  a  Bayesian  network  for  feature  fusion  to 
deal  with  the  difficult  problem  of  SAR  target  classification.  One  difficult 
problem  In  SAR  feature  identification  and  fusion  for  target  classification 
is  that  the  features  identified  may  not  be  independent  and  that  it  is  not 
easy  to  find  the  “right”  fusion  rule  to  combine  them.  The  Bayesian  net¬ 
work  model  when  constructed  properly  can  explicitly  represent  the  con¬ 
ditional  independence  and  dependence  between  various  features  and 
therefore  provide  a  sound  and  natural  framework  for  feature  fusion.  This 
paper  summarizes  our  recent  work  in  SAR  target  recognition  using  a 
feature-based  Bayesian  inference  approach.  The  approach  works  on  the 
selected  features  which  are  chosen  so  that  the  separability  of  the  original 
data  are  well  maintained  for  later  classification.  Once  the  original  data 
are  mapped  into  feature  space,  the  probabilistic  model  between  features 
and  the  target  is  estimated  and  represented  by  a  Bayesian  network, 
which  is  then  used  to  calculate  the  probabilities  that  a  target  belongs  to 
one  of  the  given  classes  based  on  the  observed  features.  A  comparison 
between  the  above  technique  and  the  traditional  statistical  approaches 
such  as  nearest  mean  and  Fisher  pairwise  is  illustrated  based  upon 
performance  on  a  fully  polarimetric  ISAR  (inverse  SAR)  image  data  set. 
Note  that  although  the  feature  set  used  in  the  paper  is  obtained  from  the 
same  sensor,  the  concepts  of  feature  selection  and  Bayesian  network 
formulation  discussed  in  the  paper  are  not  restricted  to  this  case  only. 
They  can  be  applied  for  multisensor  feature-level  fusion  as  well. 
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1  Introduction 

The  objective  of  an  automatic  target  recognition  (ATR)  sys¬ 
tem  is  to  detect  and  recognize  targets  from  sensor  data.  One 
of  the  important  components  of  an  ATR  system  is  its  clas¬ 
sifier.  The  function  of  the  classifier  is  to  categorize  input 
measurements  that  represent  detected  targets  according  to 
target  type.  The  classifier  output  corresponding  to  each  in¬ 
put  is  an  estimate  of  correct  category  label,  based  on  the 
observable  characteristics  of  the  input. 

In  general,  a  feature-b^ed  classifier  consists  of  two  ma¬ 
jor  parts,  a  feature  selection  and  a  classification  mechanism. 
For  the  purpose  of  target  classification,  the  features  selected 
do  not  necessarily  have  physical  meaning.  The  only  goal  in 
designing  features  is  to  preserve  class  discriminant  infor¬ 
mation  of  the  data  while  ignoring  information  that  is  irrel¬ 
evant  to  the  discrimination  task.  Once  a  feature  is  identi¬ 
fied,  it  will  define  a  transformation  to  map  input 
measurements  into  feature  space,  which  usually  has  a  much 
lower  dimension  than  that  of  the  input  space.  This  will 
greatly  simplify  the  classification  problem. 
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A  number  of  attributes  that  are  present  in  ISAR  images 
can  be  exploited  to  discriminate  between  targets  and  clutter 
false  alarms.  They  are  size,  shape,  signal  strength,  polari¬ 
metric  properties,  spatial  distribution  of  reflected  signal, 
and  so  on.  However,  only  a  few  of  them  can  be  used  to 
discriminate  among  classes  of  targets.  It  is  very  difficult  to 
develop  discrimination  features  for  exploiting  these  at¬ 
tributes  in  any  optimal  fashion.  Furthermore,  the  features 
identified  may  not  be  independent  and  it  is  not  easy  to  find 
the  “right”  fusion  rule.  In  fact,  past  experimental  results^ 
showed  that  adding  features  does  not  necessarily  improve 
performance  if  they  are  not  handled  correctly.  In  this  study, 
in  the  first  stage,  we  examined  up  to  twelve  features  against 
the  data;  some  were  studied  before,^  and  others  are  new. 
Out  of  the  twelve  features,  we  then  selected  the  best  dis¬ 
crimination  features  for  classification.  The  main  idea  in  this 
stage  is  to  select  the  most  useful  information  or  processed 
results,  while  ignoring  the  irrelevant  or  bad  ones.  Although 
the  feature  set  used  in  the  paper  is  computed  from  the  data 
of  the  same  sensor,  the  concepts  of  feature  selection  and 
decision  making  discussed  here  are  not  restricted  to  this 
case  only. 
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In  the  second  stage,  a  Bayesian  network  is  used  for  clas¬ 
sification.  A  Bayesian  network  is  a  directed,  acyclic  graph 
in  which  the  nodes  represent  random  variables,  and  the  arcs 
between  the  nodes  represent  probabilistic  dependence  be¬ 
tween  the  variables.  The  Bayesian  network  model  when 
constructed  properly  can  explicitly  represent  the  condi¬ 
tional  independence  and  dependence  between  various  fea¬ 
tures  and  therefore  provide  a  sound  and  natural  framework 
for  feature  fusion.  Much  attention  has  been  drawn  to  this 
technology  in  the  past  few  years  and  it  has  been  success¬ 
fully  applied  both  to  tasks  of  assessment  under  uncertainty 
and  tasks  of  decision-making  under  uncertainty."""^  Re¬ 
cently  it  has  also  been  applied  to  multisource  intelligence 
fusion.^  In  this  paper,  for  feature-level  fusion,  we  first  iden¬ 
tify  the  network  topology  for  various  target  parameters 
such  as  class  and  orientation  and  sensor  data  features.  In¬ 
stead  of  working  directly  on  the  input  measurements,  the 
transformed  data  on  the  chosen  feature  spaces  are  used  .as 
input.  With  the  selected  topology,  we  then  learn  (estimate) 
the  probabilistic  relationship  between  various  variables  in 
the  Bayesian  network.  The  conditional  probability  that  a 
target  belongs  to  a  class  given  observed  features  is  then 
computed  based  on  a  probabilistic  inference  algorithm  us¬ 
ing  the  network.  Finally,  the  target  is  assigned  to  the  class 
with  the  highest  probability.  Note  that  the  idea  proposed 
here  is  general  and  can  be  applied  to  multisensor  domain 
directly.  In  fact,  the  idea  of  applying  a  Bayesian  network  to 
multisensor  fusion  has  become  more  and  more  popular  in 
the  fusion  community.^ 

This  paper  is  organized  as  follows:  Section  2  describes 
target  sets  and  radar  image  data.  Sec.  3  introduces  the  net¬ 
work  algorithm  and  the  classification  process.  Sec.  4  pre¬ 
sents  the  performance  results,  and  finally.  Sec.  5  contains 
some  concluding  remarks. 


3  Feature-Based  Classification  Procedure 

Our  feature-based  classifier  is  composed  of  the  following 
stages.  As  shown  in  Fig.  2,  the  observed  measurement  is 
first  transformed  into  feature  space  based  on  preselected 
features.  Then  the  transformed  data  are  input  into  the  Baye¬ 
sian  network  for  probabilistic  inference.  The  result  is  a  set 
of  estimated  conditional  probabilities  that  the  observed  tar¬ 
get  is  from  one  of  the  classes  given  the  observed  features. 
Finally,  the  decision-making  procedure  simply  compares 
these  estimated  conditional  probabilities,  and  the  observed 
target  is  assigned  to  the  class  with  the  highest  conditional 
probability.  Note  that  the  feature  selection  and  Bayesian 
network  model  learning  modules  are  based  on  the  training 
data  and  are  done  a  priori  off-line. 

The  most  difficult  part  to  build  into  this  classifier  is  fea¬ 
ture  selection.  Once  features  are  chosen,  the  second  step  is 
to  learn  the  probabilistic  models  between  features  and  the 
targets  and  represent  the  model  by  a  Bayesian  network.  For 
simplicity,  we  have  assumed  a  simple  two-level  network 
topology  where  the  observed  features  are  assumed  to  be 
conditionally  independent  given  the  target  class  and  image 
azimuth  angle*  (see  Fig.  3).  Given  the  network  topology, 
the  first  task  is  to  estimate  the  conditional  probabilities  of 
the  observed  features  given  the  target  parameters.  These 

Fig.  1  ISAR  images  of  an  HH  channel  for  four  targets  at  0.4  azimuth  *Note  that  the  observed  features  are  not  conditionally  independent  given 
with  a  1-ft.xl-ft,  resolution.  only  the  target  class. 
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2  Data  Description 

The  database  used  for  this  study  was  obtained  from  MIT’s 
Lincoln  Laboratory;  it  consists  of  four  representative  tar¬ 
gets:  a  Dodge  van,  a  Chevrolet  Camaro,  a  Dodge  pickup 
truck,  and  an  International  Harvester  bulldozer.  Each  of  the 
four  targets  was  put  on  a  platform,  and  the  millimeter-wave 
radar  image  data  collected  using  a  35-GHz  normal  fre¬ 
quency  at  a  fixed  5.5-deg  depression  angle  while  the  plat¬ 
form  turned  over  a  complete  360-deg  azimuth.  These  in¬ 
verse  synthetic  aperture  radar  (ISAR)  images  are  1-ft. 


Feature  Selection 


Training  Data 


Testing  Data 


Bayesnet  Model 
Learning 


Feature  Mapping 


Bayesian  Network 
Inference 


Decision 


Fig.  2  Feature-based  classifier  with  a  Bayesian  network. 


range-processed  with  full  polarizations,  namely,  horizontal 
transmit,  horizontal  receive  (HH);  horizontal  transmit,  ver¬ 
tical  receive  (HV);  and  vertical  transmit,  vertical  receive 
(VV).  The  images  of  these  vehicles  are  available  at  0.04- 
deg  azimuth  intervals  and  each  is  associated  with  a  viewing 
angle.  The  original  image  of  the  vehicles  has  the  size  of 
32X20  pixels.  Figure  I  shows  an  example  of  four  target 
images  using  single-channel  HH  at  a  0.4-deg  azimuth. 

The  fully  polarimetric  ISAR  data  were  first  filtered  by  a 
polarimetric  whitening  filter  (PWF),^  and  then  normalized 
and  compressed  by  a  window  slicing  technique  to  a  15X9 
dimension.®  In  this  study,  5280  processed  images  were 
picked  up  from  the  database  for  each  target.  It  should  be 
mentioned  that  the  targets  look  confusing  from  different 
angles.  In  particular,  an  image  for  a  target  at  one  angle  may 
look  like  the  image  of  another  target  at  the  same  or  a  dif¬ 
ferent  angle,  while  some  images  from  adjacent  angles  for 
the  same  target  may  look  much  different.  This  made  the 
classification  task  very  difficult. 
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Fig.  3  A  probabilistic  model  between  features  and  target  states. 


estimated  conditional  probability  distributions  are  used  to 
describe  the  variable  relationships  in  the  Bayesian  network. 
In  the  rest  of  this  section,  we  will  describe  feature  selection, 
Bayesian  network  modeling  and  decision  making  in  detail 

3.1  Feature  Selection 

As  mentioned  before,  the  useful  features  here  are  those  that 
preserve  class  discriminant  information  on  the  data  while 
ignoring  information  that  is  irrelevant  to  the  discrimination 
task.  Since,  at  present,  no  method  exists  for  developing 
discrimination  features  in  any  optimal  fashion,  one  way  is 
to  test  all  the  proposed  features  on  the  data  set  to  see  how 
well  they  can  separate  the  targets  of  different  classes.  The 
good  features,  which  have  a  better  ability  to  separate  dif¬ 
ferent  target  classes,  are  maintained  for  the  further  use  of 
classification,  and  the  poor  ones  are  discarded. 

In  this  study,  a  total  of  twelve  features  were  examined, 
of  which  three  (standard  deviation,  fractal  dimension,  and 
weighted-rank  fill  ratio)  were  developed  by  MIT’s  Lincoln 
Laboratory  to  discriminate  targets  from  nature-clutter  false 
alarms  in  their  ATR  system,^  and  the  rest  are  new  features. 
Some  features  are  contrast-based,  and  the  other  spatial- 
distribution  related.  They  are 

(i)  The  standard  deviation  (SD)  feature  is  a  measure  of 
the  fluctuation  in  intensity  in  an  image.  It  is  computed  from 
the  typical  estimator  for  the  standard  deviation  by  using  the 
power  (expressed  in  decibels)  of  all  the  pixels  in  an  image. 
If  the  radar  image  in  power  is  denoted  by  P(r,a),  then  the 
log  standard  deviation  cr  can  be  estimated  as: 


S2-S]/N 


the  K  brightest  pixels  in  the  image  are  selected  and  their 
values  are  converted  to  1,  while  the  rest  of  the  pixel  values 
are  converted  to  0.  Then  the  fractal  dimension  can  be  esti¬ 
mated  as: 


dim=  “ 


log  A/i-“log  M2  _  log  Mi“Iog  M2 
log  1  —  log  2  log  2 


where  M^  is  the  number  of  1-pixel-by-l -pixel  boxes 
needed  to  cover  the  image,  and  M2  is  the  number  of 
2-pixel-by-2-pixel  boxes  needed  to  cover  the  image.  Obvi¬ 
ously,  M^-K. 

(iii)  The  weighted-rank  fill  ratio  (WRFR)  feature  mea¬ 
sures  the  percentage  of  the  total  energy  contained  in  the 
brightest  scatterers  of  an  image.  Using  the  notation  of  Eq. 
(2),  this  feature  is  defined  as  follows: 

—  brightest  pixels^(^^^) 

^  ^all  pixels^('’’^) 

(iv)  The  counting  (CNT)  feature  is  obtained  by  counting 
the  number  of  pixels  in  the  image  that  exceed  a  specific 
threshold,  then  dividing  by  the  total  number  of  pixels  of  the 
image. 

(v) -(xii)  The  following  eight  features  are  designed  to 
measure  the  spatial  distribution  of  the  brightest  pixels  in  the 
images.  First  we  convert  the  image  to  binary  by  using  am¬ 
plitude  thresholding,  in  which  all  pixel  values  exceeding  a 
specified  threshold  are  converted  to  I,  and  the  remaining 
pixel  values  are  converted  to  0.  Assuming  the  converted 
binary  image  is  denoted  as  B(iJ),  then  these  features  are 
defined  as: 


A  i=\  j=i 


^r=r;  2  2 

N  i=[ 


S,  =  2  lOlogio /’(/•.a), 


^XX~  ;v_  I  ii-MxfxB{iJ), 

ly  1  i~  1  j=  1 

9  15 

2  2 

A—  I  i=i  y=i 

1  ' 


^2=  2  [10 logic  P(r,a)]^  0) 

r,a 

N  is  the  total  number  of  pixels  in  the  image,  which  is  9x  15 
=  135. 

(ii)  The  fractal  dimension  (FD)  feature  provides  a  mea¬ 
sure  of  the  spatial  distribution  of  the  brightest  scatterers  in 
the  images.  To  calculate  the  FD  feature  of  an  image,  the 
first  step  is  to  convert  the  image  to  a  binary  one.  To  do  so. 


Wx==max{i:BiiJ)=  l}~min{/:B(/,y)=  1},  (11) 

WY==max{j:B{iJ)=^l}-xmn{j:B{iJ)^l},  (12) 

(13) 

where  N  is  the  total  number  of  pixels  in  the  image. 

Features  need  to  be  evaluated  since  using  similar  fea¬ 
tures  does  not  guarantee  a  better  discrimination  perfor¬ 
mance,  and  sometimes  adding  features  can  even  degrade 
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performance.  One  way  to  evaluate  whether  a  feature  is 
good  for  discriminating  targets  of  different  classes  is  to  test 
it  on  the  training  data  set  in  a  heuristic  manner.  The  other 
way  is  to  use  the  optimal  feature  set  selection  approach/® 
which  uses  a  Bhattacharrya  upper  bound‘d  on  the  probabil¬ 
ity  of  classification  error  to  determine  which  subset  of  any 
/  features  of  the  available  L  features  has  the  lowest  upper 
bound.  It  can  be  seen  in  Sec.  4  that  this  latter  approach  is  an 
effective  way  to  select  features  and  its  performance  is  very 
close  to  what  we  obtained  using  an  “optimal  manually  cho¬ 
sen”  feature  set, 

3,2  Probabilistic  Modeling  of  a  Bayesian  Network 

Our  objective  is  to  recognize  targets  from  the  observed 
data.  Bayesian  networks  show  great  promise  for  performing 
this  function  since  they  can  be  used  to  represent  compli¬ 
cated  probabilistic  relationships  among  variables  of  inter¬ 
est.  Furthermore,  many  efficient  algorithms  have  been  de¬ 
veloped  for  drawing  inferences  from  the  evidence.  For 
the  current  Bayesian  network  model,  let  the  class  status  of 
an  observed  target  be  a  random  variable  X,  and  assume  the 
target  belongs  to  one  and  only  one  of  K  classes;  then,  ob¬ 
viously,  X  is  a  discrete  random  variable,  and  without  loss  of 
generality,  we  can  assume  it  takes  on  a  value  1,2,... ,X  (in 
our  case  K=4),  In  a  radar  image,  an  important  factor  that 
greatly  affects  the  appearance  of  the  target  is  the  target 
orientation.  Let  0  denote  the  azimuth  angle  when  a  target 
is  imaged,  which  can  have  values  from  0  to  360  deg.  If  we 
discretize  the  360-deg  space  into  M  small  sectors  and  each 
has  an  equal  interval,  0  can  be  treated  as  a  discrete  random 
variable.  Finally,  based  on  a  set  of  selected  features,  de¬ 
noted  as  random  variables  Fi discrete  or  con¬ 
tinuous,  a  simple  two-level  probabilistic  model  between 
features  and  target  states  (class  and  azimuth)  can  be  ob¬ 
tained  as  in  Fig.  3. 

As  shown  in  Figure  3,  each  node  represents  a  random 
variable,  and  each  line  indicates  the  conditional  probabilis¬ 
tic  relationship  between  the  connected  nodes.  Note  here 
that  the  network  topology  implicitly  assumes  that  the  ob¬ 
served  features  are  conditionally  independent  given  the  tar¬ 
get  class  and  orientation.  However,  in  general,  this  is  not 
the  case,  and  a  more  complicated  network  topology  is 
needed  to  model  the  problem.  In  a  Bayesian  network,  the 
conditional  probability  distribution  of  a  child  given  all  of 
its  parents  is  assumed  to  be  given  before  any  probabilistic 
reasoning  can  be  drawn.  In  our  case,  this  is  to  say  that 
given  target  type  X  and  radar  azimuth  angle  0,  the  feature 
Fi  is  distributed  with  the  known  distribution  P(F/|X,0), 
/=1,2,...,L.  In  reality,  the  conditional  distributions 
F(F/|X,0)  need  to  be  elicited  by  expert  knowledge  or 
physical  models  or  estimated  with  the  training  data.  It 
should  be  mentioned  that  for  some  continuously  distributed 
features,  since  their  distributions  are  hardly  close  to  any  of 
the  well-known  parameterized  probability  distributions, 
they  must  be  estimated  with  nonparametric  methods.  We 
will  discuss  this  in  more  detail  in  the  next  section. 

With  the  Bayesian  network,  the  class  probabilities  of 
observed  targets  can  be  computed  with  any  probabilistic 
inference  algorithm.^^‘^^  Basically,  the  question  becomes 
one  of  how  to  calculate  the  conditional  probability  that  an 
observed  target  is  from  a  class  at  an  azimuth  angle  given 
the  observed  features,  e.g.,  P(X,0|Fi  ,F2,...,FJ.  For  the 


current  simplified  model,  since  the  features  F|,F2,...,F^ 
are  conditionally  independent  given  X  and  0,  it  can  be 
shown  that  the  required  conditional  probability  can  be  ob¬ 
tained  as:^ 

I  ^ 

P(X,0|F,,F2,...,FJ=-n  F(F,|X,0),  (14) 

C  /=! 

where  C  is  a  normalizing  constant.  If  there  are  K  classes  of 
targets  and  M  sectors  of  angles,  based  on  Eq.  (14),  we  can 
obtain  a  total  of  XX  A/  probability  estimations.  These  are 
then  used  to  make  a  decision. 


3.3  Decision  Making 

In  this  step,  the  observed  target  is  assigned  to  an  appropri¬ 
ate  class.  What  we  obtained  from  Eq.  (14)  is  a  set  of  XX  Af 
probabilities,  e.g.,  how  likely  an  observed  target  is  from 
class  k  and  at  azimuth  sector  m,  /:=  1,2,...,X,  and 
m=  1,2,.. .,Af.  To  make  a  decision,  there  are  two  basic  de¬ 
cision  rules. 

Decision  rule  1.  Among  XX  A/  estimations,  find  the  one 
with  the  highest  probability  value,  then  assign  the  target  to 
the  class  associated  with  this  estimate.  It  can  be  seen  that, 
at  the  time  the  target  class  is  determined,  so  can  the  target 
azimuth  angle.  However,  this  may  not  be  necessary,  and 
hence  we  have  the  second  decision  rule. 

Decision  rule  2.  Instead  of  estimating 
F(X,0|F|  ,F2,...,F£)  using  Eq.  (14),  we  estimate 
F(X|Fj  ,F2,...,FJ.  This  can  be  obtained  by  the  following 
equation: 

F(X|F,,F2,....FJ  =  2  F(X,0|F,,F2,...FJ 
0 

1  ^ 

=  F  2  n  (15) 

From  Eq.  (15),  we  can  obtain  X  probability  estimates. 
Again,  we  choose  the  one  with  the  highest  value,  then  as¬ 
sign  the  target  to  the  class  associated  with  this  estimate. 

When  making  a  probabilistic  inference,  an  interesting 
consideration  is  to  treat  the  radar  azimuth  angle  as  known. 
This  may  happen  if  the  radar  azimuth  angle  at  which  the 
target  is  imaged  can  be  determined  by  other  sources  of 
information.  If  this  is  the  case,  namely  the  radar  azimuth 
angle  is  known  to  be  in  the  m’th  sector,  the  conditional 
probability  that  the  target  is  from  a  specific  class  given  the 
observed  features  and  0=m  can  be  obtained  using  the  fol¬ 
lowing  equation: 

1 

F(XiF,,F2,...,F2,,0  =  m)=-n  P{Fi\X3  =  m),  (16) 

C  l=[ 

where  C  is  a  normalization  constant.  Again,  the  decision 
making  is  based  on  X  calculated  probability  estimates. 


'Again,  in  general,  the  network  is  more  complicated  and  an  efficient  algo¬ 
rithm  such  as  SPI  [16]  can  be  used  to  compute  the  conditional  probabifity 
distributions. 
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Table  1  Averaged  correct  classification  rates  (ACCR)  in  %  using  OLPARS. 


Optimal 
feature  set 

Nearest  mean  classifier 

Fisher  pairwise  classifier 

weight 

training 

testing 

training 

testing 

ail  twelve 

E 

39.7 

25.0 

86,2 

68.3 

features 

var. 

66.8 

66.2 

rej.  20 

rej.  830 

cov. 

80.2 

64.2 

1,3.6,7,12 

E 

38.3 

25.0 

83.8 

72.4 

var. 

66.0 

67,9 

rej.  20 

rej.  44 

cov. 

81.5 

76.0 

1,3, 6, 7 

E 

71.7 

71.0 

81.9 

82.0 

var. 

66.9 

67.1 

rej.  17 

rej.  49 

cov. 

80.4 

80.2 

1,3,6 

E 

69.1 

67.8 

78.4 

78.4 

var. 

69.0 

68.5 

rej.  26 

rej.  26 

cov. 

76,7 

76.9 

1,3 

E 

63.1 

62.9 

72.8 

72.2 

var. 

65.0 

65.0 

rej.  59 

rej.  141 

cov. 

72.4 

72.1 

4  Performance  Evaluation 

The  data  set  used  for  this  study  is  first  randomly  and  uni¬ 
formly  split  into  two  data  sets,  the  training  data  set 
(1728X4)  and  the  testing  data  set  (3552X4).  Using  the 
training  data  set,  the  conditional  probability  distributions 
P(F/|X,0)  are  first  estimated.  They  are  then  used  to  define 
the  Bayesian  network  and  later  to  calculate  conditional 
probabilities  P(X,0|Fi  ,F2,...,F£)  for  classification.  The 
testing  data  set  is  input  to  the  system  to  examine  the  clas¬ 
sification  performance,  and  the  results  are  reported  in  terms 
of  the  averaged  correct  classification  rate  (ACCR),  which  is 
defined  as  the  ratio  of  the  number  of  correctly  classified 
observations  to  the  total  number  of  observations  in  the  test¬ 
ing  data  set. 

4.1  Estimation  of  Conditional  Probability 
Distributions 

In  this  approach,  the  conditional  probability  distributions 
are  estimated  by  a  smoothing  kernel  approach,  which  is  the 
most  thoroughly  developed  approach  in  literature.  Assum¬ 
ing  we  have  K  classes  of  targets  {K =4  in  our  problem),  L 
features,  and  the  angle  space  is  decomposed  into  M  sectors, 
there  will  be  a  total  of  KXL'XM  distribution  functions  to 
be  estimated.  For  the  /’th  feature  F/,  k\h  class,  and  wz’th 
sector  of  angles,  P(Fi\X—k,&-m)  is  estimated  by  using 
only  that  image  data  from  the  k"th  class  of  target  and  m’th 
sector  of  angles.  The  data  are  first  transformed  based  on  the 
feature  F/.  If  we  consider  F/  ,Ff  ,...,Ff  as  n -transformed 
observations  based  on  the  feature  F/ ,  and  they  are  from  the 
k'th  class  and  m’th  sector  of  angles,  the  kernel  estimate  has 
the  form 

1  ” 

P(F;lX,0)=-S  (17) 

n  /=i 

where  k{)  is  a  probability  density  function  symmetric 


about  the  origin.  Here  we  use  a  Gaussian  density  with  vari¬ 
ance  o^,  which  is  the  only  parameter.  The  parameter  should 
be  chosen  so  that  the  ACCR  is  maximized. 


4.2  Classification  Results 

In  this  section,  for  the  purpose  of  comparison,  we  first 
present  some  test  results  by  using  traditional  classifiers — 
nearest  mean  and  Fisher  pairwise.  We  then  show  some  test 
results  to  illustrate  the  key  issues  discussed  in  previous  sec¬ 
tions.  The  first  part  of  the  results  (Table  1)  is  obtained  by 
means  of  a  software  package  called  OLPARS  (On-Line 
Pattern  Analysis  and  Recognition  System).^®  The  system 
provides  users  with  a  convenient  tool  to  realize  a  variety  of 
traditional  pattern  recognition  methods,  especially  optimal 
feature  set  selection,  as  mentioned  in  the  previous  section. 
The  traditional  classifiers  such  as  nearest  mean,  Fisher  pair¬ 
wise  and  so  on  can  also  be  designed  and  evaluated  easily. 

In  Table  1,  the  left-most  column  refers  to  the  optimal 
feature  set  selected  in  terms  of  the  Bhattacharrya  upper 
bound.  The  number  of  features  in  the  feature  set  should  be 
given  in  order  to  do  the  feature  selection.  For  example,  if 
we  decide  to  use  only  two  features,  OLPARS  reports  that 
the  best  feature  set  is  {1,3},  i.e.,  the  first  feature  SD  and  the 
third  feature  WRFR  as  defined  in  Sec.  3.1.  Corresponding 
to  each  of  the  optimal  feature  sets.  Table  1  presents  the 
classification  results  using  the  nearest  mean  and  Fisher 
pairwise  classifiers  for  both  the  training  and  testing  data 
sets  (note:  the  classifier  parameters  are  estimated  by  the 
training  data  set  only).  In  the  table,  “E,”  “var,”  and  “cov.” 
represent  the  different  distance  measures.  Euclidean, 
weighted  by  a  diagonal  variance  matrix,  and  weighted  by  a 
covariance  matrix,  respectively,  and  “rej.”  refers  to  rejec¬ 
tion.  As  can  be  seen  from  the  table,  the  feature  set  (1, 3,6,7} 
has  the  overall  best  performance. 
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Fig.4  AGCR  vs.  1/o-for {1,3.6,11} {1,3,67}. and {1,2,3},  M=48,  and 
using  decision  rule  2. 


The  second  part  of  the  results  is  obtained  using  feature- 
based  classification  with  a  Bayesian  network.  Figure  4 
shows  a  comparison  of  classification  results  among  the  op¬ 
timal  feature  sets,  i.e.,  {1,3, 6,7}  selected  by  OLPARS,  the 
manually  selected  set  {1,3,6,11}  and  the  previously  pro¬ 
posed  feature  set  {1,2,3}.  The  best  ACCRs  are  87.55%  for 
{1,3, 6,7}  and  88.00%  for  {1,3,6,11}  respectively,  and  the 
latter  feature  set  performs  slightly  better  than  the  former 
one.  The  two  feature  sets  selected  by  the  two  different  ap¬ 
proaches  have  only  one  differing  feature.  However,  the  op¬ 
timal  feature  selection  approach  is  more  computationally 
efficient.  In  Fig.  4,  it  also  can  be  seen  that  the  overall  per¬ 
formance  of  feature  sets  {1,3,6,7}  and  {1,3,6,11}  is  much 
better  than  that  of  the  previously  proposed  set  {1,2,3}. 

Figure  5  gives  a  comparison  between  decision  rules  1 
and  2.  We  find  the  second  rule  is  better  than  the  first.  How¬ 
ever,  using  the  first  rule,  the  target  azimuth  angles  can  be 
predicted  simultaneously. 

Figure  6  displays  the  impact  of  the  number  M  when  the 
azimuth  angle  space  is  partitioned.  It  can  be  seen  that  when 
M  is  increased,  performance  improves  as  well. 

The  last  result,  shown  in  Fig.  7,  is  obtained  using  Eq.  16, 
assuming  the  radar  azimuth  angle  can  be  determined  from 
other  information  sources.  In  this  model,  performance  is 
about  5%  more  accurate  than  that  of  decision  rules  1  and  2. 
The  best  ACCR  rate  is  93.3%.  This  result  is  not  surprising 
because  more  information  is  assumed  to  be  available  in  this 
model. 

By  comparing  the  two  experimental  results,  it  can  be 
seen  that  the  feature-based  Bayesian  net  classifier  performs 
noticeably  better  than  the  traditional  classifiers — nearest 
mean  and  Fisher  pairwise. 


Fig.  5  ACCR  vs.  1/<7  for  decision  rule  1  and  2.  M=48,  and  using 
{1,3.6.11} 


Fig.  6  The  impact  of  M,  using  decision  rule  2  and  {1,3,6,11}, 


5  Conclusions 

In  this  paper,  we  applied  the  Bayesian  network  technology 
for  feature-level  fusion.  Bayesian  networks  show  great 
promise  for  multiattribute  fusion  since  they  can  be  used  to 
represent  complicated  probabilistic  relationships  among 
variables  of  interest.  We  first  identify  the  network  topology 
for  various  target  parameters  such  as  class  and  orientation 
and  sensor  data  features.  The  network  topology  explicitly 
represents  the  conditionally  independent  or  dependent  rela¬ 
tionships  among  various  features.  Instead  of  working  di¬ 
rectly  on  the  input  measurements,  the  transformed  data  on 
the  chosen  feature  spaces  are  used  as  input.  With  the  se¬ 
lected  topology,  we  then  learn  (estimate)  the  probabilistic 
relationship  among  various  variables  in  the  Bayesian  net¬ 
work.  The  conditional  probability  that  a  target  belongs  to  a 
class  given  observed  features  is  then  computed  based  on  a 
probabilistic  inference  algorithm  using  the  network.  The 
network  model  used  here  is  relatively  simple.  In  a  separate 
but  related  research,  we  also  studied  the  problem  of  Baye¬ 
sian  network  construction  using  neural  learning  techniques 
where  the  network  is  more  general  and  complicated. 

In  the  current  approach,  performance  depends  on  a  num¬ 
ber  of  factors,  including  selection  of  a  set  of  workable  fea¬ 
tures,  choosing  an  appropriate  probabilistic  model  to  de¬ 
scribe  the  observations,  handling  approximation  of  the 
required  conditional  probability  distributions,  and  so  on. 
When  classifying  the  SAR  image,  not  only  does  the  Baye¬ 
sian  network  model  lead  to  a  better  performance  than  cer¬ 
tain  types  of  traditional  classifiers,  for  example,  nearest 
mean  and  Fisher  pairwise,  it  also  possesses  a  certain  degree 
of  flexibility  to  handle  other  target  parameters  such  as  ori¬ 
entation.  The  orientation  can  be  predicted  at  the  time  the 
target  is  classified.  In  the  case  where  the  orientation  is 


Fig.  7  ACCR  vs.  Ma,  assuming  0  Is  known,  using  {1,3,6,11},  and 
M=48. 
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known,  the  classification  accuracy  is  improved  signifi¬ 
cantly.  The  decomposition  of  azimuth  angle  space  and  the 
parameter  of  the  conditional  probability  distribution  esti¬ 
mates  are  two  main  factors  that  could  be  adjusted  to  im¬ 
prove  performance. 

Although  the  feature  set  used  in  the  example  is  obtained 
from  the  same  sensor,  the  concepts  of  feature  selection  and 
decision  making  discussed  in  the  paper  are  not  restricted  to 
this  case.  The  idea  of  applying  a  Bayesian  network  to 
feature-level  fusion  can  also  be  applied  to  a  multisensor 
domain  directly.  In  fact,  the  idea  of  applying  a  Bayesian 
network  to  multisensor  fusion  has  become  more  and  more 
popular  in  the  fusion  community.  The  evaluation  of  our 
results  demonstrates  the  usefulness  of  the  proposed  ap¬ 
proach. 
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ABSTRACT 

A  hybrid  automatic  target  recognition  system  is  presented  that  exploits  advances  in  two  new  fields  in 
detection  theory  and  signal  analysis.  The  first  is  in  the  area  of  Universal  Qassification  diat  offers  asyn^totic 
optimal  solutions  to  non-Gaussian  properties  of  signals  and  die  second  is  in  the  field  of  multi-resolution 
analysis  (MRA)  that  uses  the  automatic  feature  isolating  properties  of  the  wavelet  transform.  The  Universal 
Qassifier  is  used  as  die  first  stage  of  a  hybrid  ATR  system  that  efficiently  shifts  through  latge  quantities  of 
imagery  locating  regions  of  interest  that  contain  "taiget-like"  features.  The  target  chips  of  interest  are  then 
passed  throu^  die  MRA  to  be  classified  at  the  final  stage.  Wavelets  are  adequate  to  the  study  of 
unpredictable  signals  with  bodi  low  frequency  con^nents  and  sharp  transitions.  As  a  result,  there  has  been 
recent  interest  in  applying  this  new  signal  processing  field  to  the  target  recognition  problem.  But  few  have 
combined  the  natural  feature  extraction  capability  of  dme-firequency  mediods  in  die  classification  stage.  In 
this  ^)proach,  we  utilize  the  sub-space  "crystals”  from  a  specific  decon^sition  and  operate  a  classification 
strategy  against  each  crystal  of  die  transfoon.  The  complete  ATR  system  is  presented  as  well  as  performance 
exanyles  using  both  real  synthetic  aperture  radar  (SAR)  data  and  data  generated  using  the  ^Kpzxdi  signature 
prediction  code. 

Keywords:  Synthetic  Aperture  Radar,  Automatic  Taiget  Recognition,  Universal  Qassification,  Wavelet 
Multiresolution  Analysis 


1.  INTRQDUCnQN 

The  problem  concerned  with  recognition  of  targets  in  SAR  imagery  is  an  ongoing  challenging  research  topic 
that  is  important  for  military  plications.  Some  of  the  best  performing  algorithms  reported  in  government 
and  industry  are  related  to  sinqile  template  matching  algorithms^.  This  is  surprising,  since  theoretically  we 
know  that  a  matched  filter  is  optimal  c^y  under  Gaussian  class  descriptions.  C)ne  reason  fiDr  such  success  is 
the  efficient  use  of  spatial  infbnnation  that  is  ignored  by  many  ^)plications.  But  one  is  led  to  believe  that 
irrqirovements  in  performance  can  be  made. 
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Figure  1.  Hybrid  SAR  ATR  System 
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fa  this  p^r  we  introduce  a  unique  hybrid  ATR  system  (Figure  1).  The  first  stage  e^loits  recent 
developments  in  die  theory  of  Universal  aassification(W).  Prescreening  SAR  data  requires  die  ability  to 
examine  large  amounts  of  data  without  im^sing  numerous  assumptions  on  the  environment.  Universal 
dassification  has  been  shown  to  be  asyn^torically  optimum  (in  the  amount  of  data)  for  classifying  extremely 
general  forms  of  data.  The  second  stage  transforms  the  regions  of  interest  (ROI)  from  the  pre-screener  using 
a  multi-resolution  analysis  (MRA)  for  the  third  and  final  dassification  stage.  The  classifier  is  a  Gramm- 
Schmidt  image  dassifier<5)  which  discriminates  correlation  properties  of  the  transformed  signal  The 
organization  of  the  pafwr  is  as  follows,  fa  Section  2,  we  introduce  the  pre-screener  and  demonstrate  on  strip 
map  SAR  data,  fa  Section  3,  we  discuss  the  motivation  and  approach  behind  the  wavelet  deconqiosition.  fa 
Section  4  we  present  the  dassifier  and  discuss  some  preliminary  results  using  an  X^tdi  synthetic  target  data 
set  and  finally  we  summarize  in  Section  5. 

2.  UNIVERSAL  CLASSIFIER  PRF.-SrRF.F.NrF.R 

The  g^eral  theory  underpinning  diis  new  branch  of  classification  has  been  used  in  universal  data 
con^ression  to  devdop  the  Len^le-Ziv  (L^  compression  algorithm  (corr^ress  on  UNIX  machines).  This 
hi^y  robust  algorithm  has  had  an  overwhelming  inqiact  on  the  fidd  of  data  compression  and  storage.  It  is 
bdieved  that  the  dassification  extensions  of  these  pproaches  will  have  a  similar  practical  impq/t  cm 
dassification,  and  in  particular  on  SAR  dassification. 

The  basis  for  this  dass  of  algoritiims  is  the  utilization  of  statistical  distance  measures  between  the  observed 
data  and  training  data.  Of  course,  this  is  similar  to  corrdation  based  methods  in  Gatissian  environments,  but 
the  universal  methods  incorporates  the  optimum  non-linearities  when  the  environment  deviates  ficom  this 
sinplistic  assunptitm. 

Defining  the  dassification  problem  as  follows: 


H,:X-S,(X) 

H^tX-’SjiX) 

H^iX-SsiX) 


vhere  we  assume  tiiat  the  dass  densities  are  unknown,  fa  the  absence  of  a  precise  statistical  modd  for  the 
classes,  we  assume  the  existence  of  a  sequence  of  training  data  fixwn  the  source.  We  proceed  by  quantizing 
and'  fomoing  the  types  (enpirical  density  estimates)  .....  /  . 

Mathematically,  diis  problem  can  be  modded  as  an  M-ary  hypothesis  testing  problenx  Our  Universal 
Classifier  inplementation  is  a  generalization  firam  Covet<6),  h{x)  =  ) j  f  . 

.•^)  is  the  Kullback-Leibler  distance  between  the  types  Px and  Py. 

To  demonstrate,  we  pply  the  classifier  to  die  discrimfaation  between  two  dutter  types  (grass  and  trees)  of 
Lincoln  laboratories  ADTS  1-foot  resolution  SAR  strip-mp  data.  Because  the  classifier  is  asynpmtically 
optimal  the  mote  information  available  dther  through  higher  resolution  imagery  or  as  in  our  example  of 
eploiting  die  fully  fKDlarimetric  images  the  error  rate  goes  to  zero  (E%ire  3).  The  first  tow  indicates  die 
output  decision  statistics  from  testing  on  several  hundred  square  kilometers  of  dutter  data  for  varying  ROI 
operators  3x3, 5x5, 20x20.  The  second  row  is  the  respective  receiver  operating  cfaaracterics.  The  last  mlnmn 
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applies  a  Bayesian  fusion  nile  across  four^  polari2ations  that  outperforms  even  the  larger  windowed  20x20 
discriminator.  The  straig^it  line  ROC  indicates  no  errors  given  the  san^le  size. 
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Figure  2.  Universal  Qassification  Performance  using  increasing 
spatial  and  polarization  information 


A  typical  Constant  False  Alarm  Rate  (CFAR)  detector,  which  is  the  typical  pre-screening  mediod,  uses  a 
sin^e  pixel  test  that  operates  no  better  than  the  first  column  case.  Even  thou^  our  experiment  was 
performed  on  two  classes  of  clutter,  target  classes  have  even  stronger  energy  returns  and  would  converge  (to 
zero  error)  at  a  fester  rate.  If  a  target,  however,  is  imbedded  in  clutter,  for  exanqjle  trees,  there  exist  no 
algprithm  that  can  extract  a  target  where  no  signature  information  exists.  But  this  method,  nevertheless, 
would  do  very  well  against  partially  obstructed  or  occluded  targets.  The  performance  would  degrade  in 
proportion  to  the  loss  of  signature.  This  algorithm,  in  its  present  form  is  not  designed  to  discriminate 
between  targets,  but  extensions  exist  One  such  method  has  been  inq)lemented  by  Warke  and  Orsak*^  that 
has  achieved  very  hi^  classification  rates  to  the  face  recognition  problem. 


3.  WAVFXET  DECOMPOSITION 

In  this  section  we  investigate  the  second  stage  to  our  ATR  system  What  can  wavelets  contribute?  By 
definition,  wavelets  should  offer  some  potential  because  of  their  properties  of  providing  good  localization  in 
both  die  spatial  and  firequency  domains.  There  has  been  a  number  of  recent  attempts  to  e3q)loit  such  benefits 
in  the  pre-screener  or  detection  stages  of  the  ATR  problem<®»  and  a  few  atten^ts  to  tacUe  die  challenging 
later  stages  of  an  ATR  processt^®).  Ihis  paper  is  stricdy  an  attempt  to  eiqiloit  the  attractive  properties  of 
wavdets,  die  authors  are  neither  wavdet  experts  nor  crusaders.  Wavdets  have  generated  a  tremendous 
interest  in  both  theoretical  and  ^>plied  areas.  This  is  due  to  the  interesting  properties  of  a  wavdet 
representation  in  terms  of  scale  and  locarion.  In  mathematics  and  engineering  it  h^  b^  known  for  some 
time  that  techniques  based  on  Fourier  series  and  Fourier  transforms  are  not  quite  adequate  for  many 
problems.  One  of  die  assumptions  is  that  the  original  rimeKlomain  function  is  periodic.  As  a  result,  the  DFT 
has  difficulty  with  functions  diat  have  transient  conqxments,  i.e.,  conqxMients  which  are  localized  in  rime. 
This  is  particulariy  tme  with  images  that  have  frequent  texture  transitions.  The  other  proWem  with 


^  Azuiysis  indicated  diat  for  this  data,  the  cross  p^arizahon  paizs  were  not  zedundanL 


tta,  dK  mnsfbm.  do«  not  con«y  any  infonnanon  peminng  »  of  ±e 

*”  *“  coefScioit  The  “compaa  so^ort”  of  the  ravelet 

^  of  Le  ene.^’ 

their  scaling  constant  into  disjoint  subsets  spanning  orthogonal  subsnarfK  tTiaca  c  k  ^ 

»  diffaen.  scales  can  be  „  teptesenSJSZnT^  cot,=spo„dn.g 


The  discrete  wavelet  transform  of  a  function /with  respect  to  ^  is 


-  /rTj_/WV''*(ao'" 


where  f,\f/  e  L^(9t) ,  and  5r  is  the  set  of  real  numbers.  In  general  for  the  DWT  a  =  a”  ,  •where  m  is  an 
integer  and  a,,  q^l.  The  most  common  choice  is  to  set  =2.  For  the  translation  parameter, 

b-nb^Oo  where  £>0  >0.  is  caDed  the  mother  wavelet  and  is  assumed  to  be  aHmjcc.hida  fo,  all 
functions,  f  ^  L  (9i) if  there  exists  real  numbers,  A  >  0  and  B  <  oo^  such  that 


With  the  surprising  and  fortunate  recent  discovery  of  many  such  Iff  functions  such  ^A.=B=1  by  MallatP), 

Daubechies(«),  and  others;  the  discrete  wavelet  transform  offers  an  orthonoimal  bases  in  1^(3?).  In 

{«ctice  we  use  a  dis^  t^  version  of  (1)  and  inclement  it  using  a  2-dimensional  version  of  the  “pyramid 
scheme  described  by  Mallat(’).  Ihis  is  a  &st  0,^1  algorithm  for  L  N-pucel  image.  For  otTZ 

mv^optt^  the  boar,  daub&t,  ^kt,  and  a^.  Each  mother  wavelet  is  depicted  in  Hgure  3.  We  found  that 
a^f  widi  SIX  taps  u^a  three  scale  decon^ition  (deconqxjsihon  grid  in  Hgu^  provided  us  with 

m^e  otthogo^ty,  co^  support;  nearly  syininetrica^  and  vanishing  iiKsme^  Even  we 

tovrat  ngo^y  coi^iaied  Ae  performance  of  the  wavelet  fflter  options  ye^  we  suspea  that  any  Aoice 

similar  of  *e  boar).  This  is  a  reasonable  assertion  based  on  dieir 

pn^^  and  thenatu^  autmaiu:  feature  isolating  properties  of  the  transform.  We  also  inrn.Tv^  a 

mlTS  on^  signal  at  the  boundary  and  then  periodically^ids  it 

g  the  algonthm  givai  by  Bnslawn”.  Hus  is  not  necessary  for  the  ordiogonal  wavelets,  but  hdped 
prevent  the  non-symmetncal  ones  feom  drifting  in  phase.  ^ 


*  No  foonal  quantiutive  au^sis  has  been  examined  as  of  yet. 
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Figure  3.  Investigated  modier  wavelets. 


Rgure4.  Wavelet  scale  decon^xjsition 


4.  GRAM-SCHMIDT  IMAGE  CLASSIEER 

To  eJ5>loit  this  information  we  inclement  our  version  of  a  classical  correlation  dassifieiK*®)  using  a 

signal  subspace  techni(]ue  popularized  in  communications  to  gpt  target  features  where  esdi  signal  is  defined 
by  a  separate  basis  function  in  an  orthonormal  set.  If  this  set  can  be  descnTjed,  then  the  optimum  classifier 
easily  fiallows,  hence  the  minimum  probability  of  error.  One  method  of  detennirung  a  set  of  orthonormal 
basis  functions  fiom  a  s^nal  set  is  by  using  the  Gram-Schmidt  procedure. 

Starting  with  a  target  set  having  an  angular  coverage  of  q  degrees.  We  use  a  sub-set  for  designing  the 
rloggifiPT  "nie  classifier  is  designed  by  processing  a  target  set  fotou^  a  Gram-Schmidt  analysis  to  determine 
a  sub-set  that  best  accounts  for  die  chanctensdcs  of  all  the  images  within  the  target  space. 


Let  the  design  images  be  denoted  by  vectors  X;,  Xjyj-.  The  residuals  of  each  image  are 

combined  into  a  matrix  formed  from  the  vectors.  The  decomposition  process  requires  that  we 
choose  a  subset  of  images  tiiat  provides  the  best  representation  of  the  targpt  set.  We  refer  to  this 
subset  of  images  as  the  Gram-Schmidt  (GS)  set.  The  GS  algorithm  is  described  in  many  linear 
algebra  texdx)oks(i«).  In  our  case,  an  ordionormal  set  Q  is  produced  by  the  algorithm  from  the 
residuals  to  make  up  the  con^ponents  of  the  newly  reconsoucted  design  set  such  diac 

isl 


where  N  is  die  rank  of  <?. 

A  subset  consisting  of  N  images  pet  target  is  seleaed  for  the  classifier  design.  The  classifier  operates  on  the 
unknown  data  using  the  GS-set  as  follows.  A  test  image  X^  is  ptojeaed  onto  the  2V-dimensionaI  space 

spanned  by  the  GS-set. 

The  coti^KMients  of  the  projection  are  calculated  as  follows: 


where  is  the  residual  test  image.  This  test  will  determine  if  there  is  enou^  eneigy  contained  in  the  test 

image  to  fidl  within  one  of  the  target  classes.  The  .scalar  components  are  placed  into  an  N  dimensional 
feature  vector 


©,  =[©i,02,...,©Ar] 

The  max(©)  for  i  =  J...N  is  the  distance  decision  measure  corresponding  to  the  best  projection  onto  the 
GS-set  Hence,  if  die  GS-set  completely  characterizes  the  target  set  then  all  of  the  target  images  should 
project  onto  one  of  the  N  vectors  conqiletely. 


4.1  EXPERIMENTAL  RESULTS 


For  our  esperimental  analysis  of  the  discrete  wavelet  transform  Gram-Schmidt  classifier  (DWTGS)  we  use  a 
large  data  base  provided  by  the  Mcxlel  Based  Vision  Laboratory  of  Wrigjit  Patterson  Air  Force  ^e.  This 
data  base  was  produced  as  an  initial  test  set  for  the  Moving  and  Stationary  Target  Recognition  (MSTAR) 
program.  Since  it  contains  four  target  classes  well  refer  to  it  as  the  MSTAR-4  data.  We  formed  images 
using  141  phase  hismties  in  ang^e  to  obtain  one  fixit  cross  range  resolution  and  101  fisquendes 
(approximatdy  500MHz  bandwiddi)  to  produce  one  frxit  range  resolution.  No  paddir^  or  use  of  powers  of 
two  are  need^  (with  128  frequendes,  range  resolution  would  be  0.78  feet,  with  50%  oversan^ling,  each 
sample  is  now  0.66  feet,  but  resolution  is  still  one  foot).  We  use  a  2D  Hanning  window  for  30dB  siddobe 
si^ression.  Our  initial  tests  use  two  of  the  military  objects  contained  in  the  MSTAR.-4  data.  Well  refer  to 
these  objects  as  T1  and  Ml.  We  restrict  ourselves  to  180  degrees  in  aspect  and  use  a  constant  devation  an^e 
of  30°.  Using  a  GS  set  of  10  spanning  180  degrees  (basically  every  18°)  we  formed  our  filter  set  for  each  of 
the  two  dasses.  Using  independent  data  we  tested  this  against  180  test  rentes  spanning  the  same  aspect 
coverage.  The  confusion'matnx  results  are: 
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dassifier.  Successful  experiments  implementing  these  approaches  were  demonstrated  using  both  real  and 
synthetic  SAR  data. 
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ABSTRACT 

During  past  years,  several  methods  have  been  developed  for 
learning  Bayesian  networks  from  a  given  database.  Some  of 
these  algorithms,  due  to  their  inherent  nature,  are 
computational  intensive,  and  others  A^diich  employing  a 
greedy  search  heuristic  can  not  guarantee  to  obtain  an  I-map 
(independency  map)  of  the  underlying  distribution  of  the 
data,  even  if  the  sample  size  is  sufficiently  large.  The  focus 
of  this  paper  is  on  developing  efficient  methods  for  learning 
Bayesian  networks.  A  number  of  attributes  about  a  Bayesian 
network  and  learning  metric  are  identified.  Based  on  these 
properties,  new  learning  algorithms  are  developed  which  can 
be  shown  to  be  computationally  efficient  and  guarantee  the 
resulting  network  converging  to  a  minimal  I-map  given  a 
sufficiently  large  sample  size. 

1.  INTRODUCTION 

In  recent  years,  research  of  Bayesian  network  technology  has 
been  mainly  focused  on  two  directions,  Bayesian  network 
inference  algorithm  and  structure  construction  and  refining. 
In  aspect  of  structure  construction,  traditional  AI  researchers 
used  expert  knowledge  to  construct  Bayesian  networks. 
More  recently,  AI  researchers  and  statisticians  have  begun  to 
develop  new  methods  for  learning  these  networks.  These 
methods  combine  prior  knowledge  with  data  to  produce  one 
or  more  Bayesian  networks.  The  resulting  networks  can  be 
used  for  inference  or,  in  special  cases,  to  infer  causal 
relationships  among  variables.  A  Bayesian  network  structure 
when  associated  with  a  set  of  conditional  probability 
distributions  defines  uniquely  a  joint  probability  distribution 
(jpdf)  on  the  n  domain  variables.  However,  a  given  jpdf  may 
have  at  most  as  many  as  n\  (minimal)  Bayesian  network 
representations.  Hence,  finding  the  simplest  network 
representation  (a  network  which  has  the  least  number  of 
arcs)  among  these  minimal  network  representations  is 
considered  NP-hard. 

When  learning  a  Bayesian  network  from  data,  the 
computational  complexity  is  a  major  concern.  There  are  a 
number  of  learning  algorithms  relying  on  a  so  called 
conditional  independence  test.  For  example,  in  Srinivas's 
method  [7],  the  recursive  algorithm  used  for  building  a 
sparse  Bayesian  network  of  A:-fl  variables  based  on  the 


existing  network  K  of  k  variables  is  to  find,  for  the  k-^  1th 
variable  x,  a  minimal  subset  Z  m  K  such  that  x  is 
conditionally  independent  of  K-Z  given  Z.  One  way  to  do 
this  is  by  generating  all  possible  subsets  of  AT  in  increasing 
order  of  size  until  Z  is  found.  Such  exhaust  search  requires 
2*  independence  checks  when  adding  the  /c+lth  node.  The 
total  number  of  independence  checks  is  approximately  0(2”). 
When  the  number  of  domain  variables  is  large  the  algorithm 
is  computationally  intractable.  Another  learning  algorithm, 
CONSTRUCTOR,  developed  by  M.  Fimg  et  al.  [4]  is  also 
suffering  from  the  similar  problem. 

The  other  category  of  methods  to  construct  Bayesian 
networks  from  database  is  by  defining  a  search  metric  (or 
score  function)  which  is  a  function  of  the  network  structure 
and  the  database.  Then  a  search  strategy  is  employed  to 
identify  a  network  structure  which  has  the  maximum  score 
among  all  possible  network  structures.  A  typical  algorithm 
of  such  category  is  the  Bayesian  method  (K2)  developed  by 
Cooper  and  Herskovits  [3].  In  the  algorithm,  they  defined  a 
score  function,  p{Bsf)\  which  is  proportional  to  the  posterior 
probability  p{Bs]P\  By  giving  a  predefined  node  order,  the 
problem  of  searching  a  network  structure  which  maximizes 
the  score  function  becomes  to  search  a  parent  set  for  each 
node  that  maximizes  the  local  score.  Since  a  greedy  method 
is  used  when  searching  the  parent  set  for  each  node,  the 
algorithm  can  not  guarantee  to  find  a  network  structure 
which  has  maximum  score  among  all  structures  obeying  the 
given  node  order.  In  addition,  it  can  be  shown  that  in  general 
the  algorithm  can  not  guarantee  to  find  any  I-map^  [2]. 

This  paper  is  organized  as  follows.  In  Section  2,  we 
introduce  notational  conventions,  definitions,  and 
assumptions  used  in  the  remainder  of  the  paper.  In  Section 
3,  the  new  learning  algorithms  are  presented  which  can  be 
shown  to  be  computationally  efficient  and  guarantee  to  find  a 
minimal  I-map  of  the  underlying  distribution  of  the  database 
when  the  sample  size  is  sufficiently  large.  In  section  4,  we 
give  the  conclusion  remarks  and  future  work. 


‘  A  network  topology  guarantees  that  nodes  found  to  be 
separated  correspond  to  independent  variables  [6]. 


2.  PRELIMINARIES 


Throughout  the  discussion,  we  consider  a  domain  U  of  n 
discrete  random  variables  (r.v.),  x\,  Xn.  Each  variable  has 
a  finite  number  of  values.  The  lower-case  letters  refer  to 
r.v.’s,  and  upper-case  letters  refer  to  sets  of  r.v.’s.  Let be  a 
joint  probability  distribution  over  U.  Let  7,  and  Z  be 
disjoint  subsets  of  U.  We  use  p(X\Y)  to  represent  the 
conditional  probability  distribution  function  (cpdf)  for  X, 
given  all  possible  instantiations  of  7.  We  say  that  X  are 
conditionally  independent  of  Y  given  Z,  denoted  as  /(Y,Z,7), 
ifp{X\ZY)  =  p{X\Z)  (or  equivalently  p{XJ\Z)  =  p{X\Z)  p{Y\Z)) 
for  all  possible  value  assignments  ofX,  7,  and  Z. 

An  observation  or  a  sample  over  U  is  a  value  assignment  to 
all  variables  in  U.  A  database  D  of  observations  over  U  is  a 
list  of  observations.  In  this  paper,  we  assume  that  the 
observations  in  the  database  are  independent  of  each  other. 
Further  more  we  assume  that  there  are  no  observations  with 
missing  values  in  the  database. 

DeHnition  1  Let  be  a  directed  acyclic  topological 
structure  for  the  domain  U  and  Bp  is  a  set  of  specified 
conditional  pdf  p(xi  Pi),  \^ere  Ili  is  the  parent 

set  of  x,  defined  by  Bs^  Then  a  Bayesian  network  B  is 
defined  to  be  a  pair  {Bs,  Bp). 

A  Bayesian  network  structure  Bs  specifies  for  each  node  Xi  a 
parent  set  n, ,  z-1,  ...,  n.  Thus,  we  can  write  Bs  =  {Hi,  ..., 
Tin} .  A  Bayesian  network  for  a  domain  U  represents  a  jpdf 
over  U.  It  can  be  shown  [6]  that  a  given  Bayesian  network 
(Bs,  Bp)  uniquely  specifies  a  jpdf  p  of  U,  where 

/>(x,,-,jc„)=np(-*/|ni).  (2.1) 

1=1 

On  the  other  hand,  given  a  jpdf on  a  domain  U,  we  always 
can  find  a  Bayesian  network  B-(Bs,  Bp)  defined  on  U  such 
that  (2.1)  holds.  Thus,  B  can  be  called  a  Bayesian  network 
of  the  jpdf  p.  It  is  well  known  that  a  given  jpdf  may  have 
many  Bayesian  network  representations,  that  is,  they  all 
represent  the  given  jpdf  An  important  property  of  a 
Bayesian  network  is  its  node  ordering.  A  node  ordering  is  a 
priority  constraint  such  that  if  jc,  precedes  xj  in  the  ordering, 
then  we  do  not  allow  structures  in  ^\dlich  there  is  an  arc  fi-om 
Xj  to  Xi.  Now,  let  B  =  {Bs,  Bp}  be  a  Bayesian  network  of  p 
with  an  order  0:  z'l,  ii, in-  Then  it  can  be  shown  that 

l^zi  ) ,  for  A:  =  1, n.  (2.2) 

It  follows  that  if  B'  -  {Bs,  Bp}  is  another  Bayesian  network 
of p  with  the  same  order  0,  then  by  (2.2) 

\^ij,  )  =  )  >  for  /:  =  1, ...,  n.  (2.3) 

We  can  always  identify  a  node  order  for  a  Bayesian  network. 
Conversely,  given  a  jpdf  p  and  an  arbitrary  node  ordering  /i, 
h,  ...,  in,  it  can  be  shown  that  there  must  exist  a  Bayesian 
network  which  has  the  given  order  and  represents  p.  In  fact, 
based  on  the  chain  rule  of  probability. 


pix\,-,x„)  =  ki,  .-.^4.. )  •  (2-4) 

k=\ 

The  right  hand  side  of  (2.4)  defines  a  Bayesian  network  of  p 
which  is  fully  connected.  Note  that  a  fully  connected 
network  carries  no  information  about  conditional 
independence  assertions,  so  it  is  not  very  useful.  In  order  to 
reveal  as  much  conditional  independence  as  possible,  it  is 
necessary  to  construct  a  Bayesian  network  which  has  no 
redundant  arcs  imder  the  given  node  order.  We  introduce  the 
following  definitions: 

Definition  2  Let  B  -  (Bs ,  Bp)  is  a  Bayesian  network  of  p 
with  a  given  node  order.  B  is  called  minimal,  if  any  arc  of 
Bs  is  removed,  then  p  can  not  be  represented  by  B  for  any  Bp. 
A  minimal  Bayesian  network  structure  is  also  called  a 
minimal  I-map. 

Most  of  Bayesian  network  learning  algorithms  are  to  find  a 
minimal  I-map.  If  a  node  order  is  specified,  the  learning 
procedure  is  reduced  to  find  for  each  node  its  parent  nodes 
from  all  candidate  nodes  which  are  precedent  in  order. 

3.  LEARNING  BAYESIAN  NETWORKS 

Learning  a  Bayesian  network  from  a  database  D  of 
observations  comprises  two  tasks:  learning  the  network 
structure  Bs,  and  after  a  proper  network  structure  is 
identified,  estimating  the  set  of  conditional  probabilities  Bs. 
In  this  paper,  the  focus  is  on  developing  efficient  algorithms 
in  learning  network  structures.  Once  Bs  is  obtained.  Bp  can 
be  estimated  from  database  statistically. 

A  Conditional  Independence  Test  (Cl)  Approach 

Some  of  the  Bayesian  network  learning  algorithms  employ  a 
so  called  conditional  independence  test  (Cl).  The  typical 
example  include  the  algorithm  developed  by  Srinivas  [7]  and 
the  CONSTRCTOR  by  Fung  [4]. 

Srinivas's  method  is  designed  to  construct  a  sparse  network. 
The  algorithm  begins  from  an  empty  network,  and, 
recursively  adds  one  node  at  each  step  to  the  existing 
network  until  all  n  nodes  are  included  in  the  network.  More 
specifically,  let  Xk  denote  the  set  of  k  nodes  in  the  existing 
network  at  /cth  step,  then  for  each  ofn-k  remained  nodes, 
namely,  x  e  U  -  Xk,  identify  its  parent  nodes  11;^  c  Xk  such 
that  p(x\Xi^  =  p(x|nx).  The  node  which  has  the  least  number 
of  parents  is  chosen  to  be  added  to  the  existing  network.  At 
the  first  step,  the  algorithm  needs  to  choose  a  root  node 
based  on  either  an  expert  prior  knowledge  or  a  random 
selection  if  no  prior  knowledge  is  available.  To  identify  for  a 
node  its  parent  set  at  kth  step  requires  a  total  of  2^  Cl  tests 
which  is  the  total  number  of  possible  subset  of  Xk  .  Hence, 
the  total  number  of  Cl  tests  needed  for  identifying  a  network 
is  proportional  to  2". 

The  algorithm,  CONSTRUCTOR,  suffers  from  the  same 
problem.  In  the  algorithm,  the  idea  is  to  find  for  each  node  x 
e  U  a  Markov  blanket  B,  c  U.  The  Markov  blanket  shields 


X  from  U-{x},  namely,  ;?(  x  |  )  =  X  x  |  ^/  ),  and  is 

comprised  by  jc^s  parents,  children,  and  child's  parents. 
Based  on  the  information  provided  by  the  Markov  blankets 
for  all  nodes,  a  Bayesian  network  can  then  be  constructed. 
Again,  it  is  a  nontrivial  problem  to  identify  B^'s  since  it  needs 
2”'*  tests  for  each  node  and  a  total  of  Cl  tests  if  an 
exhaust  search  is  used. 

Both  algorithms  mentioned  above  are  computationally 
intractable  because  of  their  search  requirement.  In  the 
remaining  of  this  section,  we  will  focus  on  developing  an 
efficient  search  method  to  identify  the  parent  set  or  the 
Markov  blanket  for  each  node.  The  problem  can  be 
formulated  as  follows:  let  Fc  -  {x},  the  goal  is  to  find  a 
smallest  subset  UxQV  such  that 

p{x\U^)  =  p(x\V).  (3.1) 

Using  conditional  independence  symbol,  (3. 1)  can  be  denoted 
as  I(x,  fix  ,  F  -  IIx  ),  namely,  x  and  F-flx  are  conditionally 
independent  given  IT,.  Before  deriving  our  efficient 
algorithm,  we  introduce  some  useful  properties  of  the 
conditional  independence  which  can  be  found  in  [6]. 

Let  Y,  Z,  W  Q  U  he  mutually  disjoint  sets  of  variables. 
Then  we  have  following  properties: 

weak  union  I(X,  Z,  fVY)  =>  I(X,  ZfV,  Y)  &  /(X,  Z7,  fV), 

intersection  I{X,  ZW,  Y)  &  I{X,  ZY,  W)  =>  1{X,  Z,  WY), 

where  the  intersection  requires  the  probability  distribution  p 
be  absolutely  positive.  In  the  follows  we  assume  p  is 
absolutely  positive.  Based  on  the  properties,  we  then  have 
the  following  theorem. 

Theorem  1  Let  x  e  U,  and  F  c  U  -  {x}  be  a  set  of  random 
variables.  Then  there  exists  a  subset  IT  c  F  such  that  /(x,  11, 
F-H)  holds  if  and  only  if  /(x,  V-{y} ,  y)  holds  for  V  y  €  F-  IT. 

Proof:  Necessity:  for  any>^  e  F-  Z,  let  Y  =  {y},  W=  V- 
n -  {y},  Z  =  n,  and  =  {x} .  Then  the  necessity  part  follows 
obviously  by  the  weak  union  property. 

Sufficiency:  let  F  -  H  =  {yi,y2,  Now  we  have 

for  V /=  1,2,  ...,^.  (3.2) 

LetJ7=  {x},7=  {yi},  {y2}.  andZ=  F-  {yi,y2},then 

we  have  /(J7,  ZPF,  Y)  and  /(X,  Z7,  W).  By  the  intersection 
property,  we  obtain  I{X,  Z,  WY),  which  is 

Kx,  V-{yi,y2},  {yuyi}  ). 

Now  let  7  =  {yuy2},  W=  {y^}.  and  Z  =  F-{yi,y2,y3}.  Again 
using  the  intersection  property,  we  obtain 

Kx,  F-{yi,y2,y3},  {yuyi.yi}  )- 

Repeating  this  procedure,  eventually  we  obtain 

I(x,  F-{yi,  ...,yk},  {yi,  ...,yk}  ), 

whichis/(x,  n,  F-fl),  || 

Theorem  2  Let  x  €  U,  and  F  c  U  -  {x}  be  a  set  of  random 
variables,  7  =  {y:  /(x,  F -{y},  y)  },  and  Hx  =  F-7.  Then  we 


have  /(x,  fix,  F-  Ilx)  holds.  Furthermore,  Ilx  is  the  smallest 
conditioning  set  in  the  sense  that  if  there  is  another  LI  such 
that  /(x,  n ,  F  -  n)  holds,  we  must  have  flxC  IT. 

Proof:  Note  7  =  F  -  fix,  then  by  the  definition  of  7  and 
theorem  1,  it  follows  that  /(x,  Hx ,  F  -  Hx)  holds.  The  second 
part  of  theorem  is  proved  as  following: 

/(x,nx,F-nx)  and/(x,n,F-n) 
iff  (by  theorem  1) 

/(x,  F -{y},y)  for  V y  G  F-  fix,  and 
/(x,  F-{y},y)  forVy  G  F-H. 
iff  /(x,  F-{y},y)  for  Vy  G  (F- nx)u(F- H)  =  F- HxO  H. 
By  the  definition  of  7,  we  must  have  F  -  ILn  n.c  7,  which 
implies  Tlx  =  F-7  c  flxO  IT.  Thus  it  follows  that  ITx  c  11.  || 

Theorem  2  provides  us  an  efficient  methods  to  identify  the 
smallest  conditioning  set  for  a  given  variable.  To  search  for 
a  node  x  its  parent  set  fix  c  F,  what  needs  to  do  is  for  each  y 
G  F  to  check  if  /(x,  F  -{y},  y)  holds,  that  is  if  x  and  y  are 
conditionally  independent  given  F-{y}.  If  the  answer  is 
"yes",  y  must  not  belong  to  fix,  and,  if  the  answer  is  "no",  y 
must  belong  to  fix.  Thus  the  number  of  Cl  tests  required  to 
identify  parent  set  is  equal  to  the  size  of  F.  Back  to 
Srinivas’s  method,  at  /rth  step,  we  need  only  (n  -  k)k  Cl  tests, 

and  totally  we  only  need  k(n  - 1) «  /  6  Cl  tests.  In 

CONSTRUCTOR,  we  only  need  «(n-l)  Cl  tests. 

The  other  methods  to  construct  Bayesian  networks  from 
database  is  by  defining  a  search  metric  (or  score  fimction) 
which  is  a  function  of  network  structure  Bs  and  the  database 
D.  A  network  structure  Bs  which  has  the  maximum  score 
among  all  possible  network  structures  is  identified.  In  the 
remaining  of  this  section,  we  discuss  two  approaches,  and 
develop  an  efficient  search  algorithm. 

A  Minimum  Description  Length  Approach  (MDL) 

The  MDL  method  is  based  on  the  minimum  description 
length  principle  [1]  which  stems  from  coding  theory.  The 
goal  of  the  method  is  to  create  a  network  structure  that 
describes  the  database  as  accurately  as  possible  with  as  few 
symbols  as  possible. 

Let  U  =  {xi,  X2,  ...,  Xn},  where  each  x,  can  take  a  value  from 
{xiu  Xi2,  Xiri},  r,  >1,  i  =1,  ...,  n.  Let  Dn  be  a  database 
with  N  observations  over  U.  Let  Bs  denote  a  network 
structure  over  U,  and  for  each  variable  x, ,  let  FI,  be  the  set  of 
parents  of  x,  defined  by  Bs.  Furthermore,  for  each  ITi ,  let  Wy 
denote  the yth  instantiation  of  Ili,j  =  1,  ...,  >  0.  Now, 

let  Nijk  be  the  number  of  observations  in  in  which  the 
variable  x,  has  the  value  x,*  and  IT,  is  instantiated  as  wy. 

Finally,  let  Ny  =  •  Then  the  description  length 

L(Bs,  D/^)  of  the  network  structure  Bs  given  the  database  Dfj 
is  defined  by 

^Bs,D^ )  =  \ogPiBs)-Nx HiBs,Df,)-^K\ogN  (3.3) 
where  K  =  9,(75  - 1) ,  and 


)  =  -Z  Z  Z^log^ 

i=]  j=:\k^\  N  Nij 


The  first  term  of  (3.3)  models  the  prior  distribution  on 
network  structures.  The  second  term  represents  the 
conditional  entropy  of  the  network  structure  Bs,  and  it  is  a 
non-negative  quantity.  In  the  third  term,  the  factor  K  is  the 
number  of  (independent)  probabilities  that  have  to  be 
estimated  from  the  database  D  for  obtaining  the  probability 
tables  Bp  for  the  network  structure  Bs.  With  an  increasing 
number  of  arcs,  a  network  structure  will  be  able  to  more 
accurately  describe  a  jpdf  which  generates  the  database,  so 
the  entropy  term  -N  x  H(Bs,  D^)  increases.  However,  the 
cost  term  1/2  K  logA^  decreases  when  more  arcs  are  added. 
The  network  structure  with  the  highest  quality  will  balance 
both  these  terms,  and  has  the  highest  score  in  (3.3).  For  a 
problem  domain  U  with  n  variables,  the  number  of  possible 
network  structures  is  huge,  thus  an  exhaust  search  is 
computationally  prohibited.  An  alternative  way  is  to  use  a 
greedy  search.  Unfortunately,  the  greedy  search  can  not  find 
a  global  maximum,  and  for  certain  type  of  distributions  that 
generates  the  database,  it  can  not  lead  to  a  I-map  no  matter 
how  large  the  database  is  [2]. 


To  develop  an  efficient  algorithm,  we  first  investigate  some 
properties  of  the  MDL  score.  Let  U  be  defined  as  before, 
and  Dn  =  {Ui,  C/2,  ...,  be  N  independent  samples  of  U. 
Let  B  -  (Bs,  Bp)  be  a  Bayesian  network  with  a  parameter  set 
Bp-  {Oi/k}  where  Oyk-p(Xi  =  Xik  \  Ui  =  Wy),  i  =1, ...,  n,  y  =  1, 
...,  ^/,  and  /r  =  1 , n.  It  can  be  shown  that  the  posterior  log 
likelihood  of  given  B  is: 

log  p(D^\Bs,  Bp)  =  '^logp(U}j\Bs,Bp) 

h=\ 


=  ZZZiogV 


7=1  j=\k=] 


7=1;=:I 
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«  qi  q 

=  ZZZiog 


7=l;=lJfc=l 


V 

^ij) 


^ijk 


-  log  p{Dj^  yBp ) , 


(3.5) 


where  Bp  =  }  with  ^ijk  -  Njjf^  /  Njj ,  is  the  maximum 

likelihood  estimator  (MLE)  of  Bp  with  K  degree  of  freedom, 
where  K  is  defined  as  in  (3.3).  Note  that  the  right  hand  side 
of  inequality  (3.5)  is  -NxH(Bs,  Dn\  so  the  entropy  term  in 
MDL  score  is  the  posterior  log  likelihood  of  given  Bs  and 
the  conditional  probabilities  which  are  estimated  by  relative 
frequencies  from  Dn. 


Based  on  statistical  theory  of  hypothesis  testing  [8,  p.381],  if 
the  probability  distribution  p  represented  by  B  is  the  true 
distribution  that  generates  the  sample  Dn^  the  log  likelihood 
ratio  “2 log  piP^^  ,Bp )  /  piDj^  [B^  ,Bp )  will  converge,  as 

the  sample  size  ^  00,  to  a  chi  squared  with  K  degree  of 
freedom.  Hence  we  have  the  following  lemma. 

Lemma  2  Let  Dn  =  {C/i,  U2,  ...,  Un}  be  N  independent 
samples  of  U  with  probability  distribution  p  defined  over  U. 
Let  B  =  (Bs,  Bp)  be  a  Bayesian  network  defined  over  U. 
Then  if  the  joint  probability  distribution  represented  by  B  is 
identical  to  p,  we  have  as  00, 

li-N  ^H{Bs,Dfj)-\ogp(,Dtj\Bs,Bp))  ^  xl 
in  distribution,  where  x\  means  a  chi  square  distribution 
with  K  degree  of  freedom.  || 

The  proof  is  simple  because  -ACx//(BsJ^v)=  logp(DA^Bs,  Bp  ). 


When  learning  Bayesian  network,  it  is  desirable  to  find  a 
minimal  I-map  of  p.  To  attain  this  purpose,  we  need  to 
identify  some  asymptotic  properties  associated  with  the  MDL 
score.  Combining  Lemma  1  and  Lemma  2,  we  have  the 
following  theorem. 


A  simple  fact  associated  with  the  posterior  log  likelihood  is 
described  in  the  following  lemma. 

Lemma  1  Let  Dn  -  {U\,  Ui,  ...,  JJn}  be  N  independent 
samples  of  U.  Let  B  =  {Bs,  Bp)  and  B'  =  {B*s,  Bp)  be  two 
Bayesian  networks  over  U.  If  B  and  B'  represent  the  same 
joint  distribution  p  over  C/,  then 

Iog;7(D^ IB5 ,Bp )  =  log p{D^\Es  P'p  )  • 

Proof:  Since  B  and  B'  represent  the  same  joint  distribution  p 
of  U,  we  have 

\o%p{D^\Bs,Bp)  =  \o%p{Df^\p)  =  log p{Dj^\B's  ,B’p ) .  || 

Now  let  us  examine  the  right  hand  side  of  (3.4).  The 
expression  in  (3.4)  can  be  maximized  with  respect  to  ^s  by 
using  Shannon's  inequality,  that  states  Ej  Oi  log  a,  >  Ei  a,  log 
bi  for  all  <2/ ,  >  0  where  Eja,  =  EiC?,  =  1,  with  the  equal  sign 

holds  iff  Qi  =  bi,  V  i.  Hence  we  have 

\ogp(_D^\Bs,Bp)i  i  ZNy  Z^log^ 

7=1  /=1  k^\  ^ij 


Theorem  3  Let  =  {Uu  C/2,  ...,  C/v}  be  N  independent 
samples  of  U  with  an  absolute  positive  probability 

distribution  p  over  C/.  Let  B^  =  {n( ,  •  •  • ,  } , 

B|  =  {nf  and  B^  =  {ITj  be  three  network 

structures  obeying  the  same  node  order,  where  B^  is  a 

minimal  I-map  of  /?,  B|  is  a  fully  connected  network 
structure,  and  Bs  is  obtained  by  deleting  an  arbitrary  edge  of 
B| .  Then  for  any  large  positive  number  M»  1 ,  we  have 

P{L>{B^  ,D]^ )  —  L^Bg  )  >  A/)  — ^  1  as  N  — ^  00, 

if  the  deleted  edge  does  not  belong  to  Bg ,  that  is  II/  c  Ylj , 
2  =  1, ...,  n. 

In  order  to  prove  the  theorem,  we  need  the  following  lemma. 

Lemma  3  Let  {Xn}  and  {7iv}  be  random  variable  sequences 
which  converge  in  distribution  to  random  variable  X  and  Y  , 
respectively.  Let  be  sequence  that  converge  to  minus 
infinity,  then  we  have 


P(Xf^  )  “>  0 


as  00. 

Proof.  For  any  M\  >  0,  and  Mi  >  0,  when  N  is  sufficient 
large  such  that  qn  <  -Mi ,  we  have 

{Xm  -^n}  =  {Xi\f  < } 
-({X^<a^^Y^}n{Y^<M,W 
({Xf^  <af^  ‘\-Yf^}r[{Yj^  >^i}) 
c  <  aj^  +  M^}[J{Yj^  >  M] } 

which  implies 

P(Xf^ -Yj^<a^)<P(X^  <Mi  -M2)+P(F)v  >Mi) 

=  Fx^(M,-M2)  +  l-Fy^(M,y 
where  F  denote  the  cumulated  distribution  function.  Thus, 
lim  P(X^  ~^N  ~  ^Tv)  —  Fx(M\  —  A/2)  +  l  “Fy  (Mj ). 

Ar-^oo 

Letting  Mi-^co  followed  hyM\  ->  oo,  we  obtain 
lim  P{X^  -Y^  <af^)  =  0  .  j| 

A/->oo 

Now  we  begin  to  prove  Theorem  3. 

Proof  (Theorem  3).  It  is  equivalent  to  show 

P(1XPs->F^n  )  ^  ^F^n  )  —  '^)  ^  zsN  — >  qo. 

The  difference  between  two  scores  is 
F-A^xF(F5,D;^)+Nx//(F|,D;^)+T(^^_A:)logA^,  (3.6) 

where  F  =  logF(F^)/F(F|)  is  a  constant  as  iV  ->  oo, 

^  ~F).  Assume  that 

Bs  is  obtained  by  deleting  one  of  the  parent  nodes  of  that 
is  c  nf  ,  and  ITy  =11^,  V  /  /.  Since  p  is  absolutely 

positive,  when  N  is  sufficient  large,  all  possible 
instantiations  of  11^  must  occur,  thus  ,  and  q^j  =  qj , 

V  y  i.  Note  that  all  rj  =  rj .  We  conclude  -  F  >  0,  and 
hence,  the  third  term  in  (3.6)  goes  to  infinity. 

On  the  other  hand,  since  both  and  Bs  are  super  graphs  of 
Bs  y  we  can  find  Bp  and  Bp  such  that  B^  =  {B^.Bp)  and 
B  =  {Bs^Bp )  both  represent p.  Then  by  Lemma  2,  as 

X^^2i-NxH(Bs,D^)-\ogp(D^\Bs,Bp))  ->  xk, 
Ytj^2(-NxH(Bl,D^}-\ogp(D^\Bl,Bf.))  ■ 

Note  that  by  Lemma  1 ,  log(D/^  ,Fp )  =  log(Djv  |F|  ,Bp ) . 
We  conclude 

PmBsMN)-IXBl^D^)<M) 

=  P(X^  -7^  <2M-2F-(F^-r)logA^)-»0asiV-»oo 

where  the  limit  in  the  last  step  follows  the  Lemma  3.  || 

Theorem  3  tells  us  when  search  a  minimal  I-map  we  can 
begin  from  a  fully  connected  network  structure,  then  delete 
one  edge  and  examine  the  MDL  score.  If  the  deleted  edge  is 
not  an  edge  in  the  minimal  I-map,  the  score  will  increase 


with  high  probability.  However  a  question  arises:  what 
would  happen  if  the  deleted  edge  is  an  edge  in  the  minimal  I- 
map.  The  following  theorem  will  answer  this  question. 

Theorem  4  Let  D/^  =  {Uu  Ui,  ...,  Un}  be  N  independent 
samples  of  U  with  a  absolute  positive  probability  distribution 

p  over  U.  Let  =  {Of  ,-,0^} .  and 

=  {Hi  ,***,n„}  be  three  network  structures  obeying  the 

same  node  order,  where  B^  is  a  minimal  I-map  of  p,  is 
a  fully  connected  network  structure,  and  Bs  is  obtained  by 
deleting  an  arbitrary  edge  of  .  Then  almost  surely  (a.s,) 
we  have 

L(Bs L(Bs^Dp^)—^  —CO y  as  A/^ -400, 
if  the  deleted  edge  is  an  edge  in  Bs »  that  is,  3  i  such  that 
nf  ctUi  cnf  ,and  n^.  cn^  =n5  forVy>/. 

Proof:  As  discussed  in  the  proof  of  Theorem  3,  the  third 
term  in  (3.6),  J^(Ar^  -  A!')log  A^  ->>  +  00,  and  the  first  term  R 

is  a  constant.  Now  let  us  consider  the  behavior  of  the 
entropy  term  H(Bs^  Djv).  By  the  strong  law  of  large  numbers, 
we  have  that 

^ijk 

— - >  p(Xi  =  =Wij)  as  N^ao  a.s. 

^ijk 

p{xi  =XikJlj  =  Wy)  as  N->co.  a.s. 

Then  we  have 

-H{Bs.D^)+H{B$^Dj,)^  a.s. 

ZZp(-^;  =Xik,ni=Wij)\ogp{Xi  =A:ftin,  =Wy)- 

yC 

-z  Zp(^;  =  Xik,Tl^  =  wfj)log p(Xi  =  ) . 

j=]k=] 

By  marginalization  the  above  equation  can  be  written  as 
Z  ZK-*/  =  Xik>Tl°  =  w^)logp(Xi  =  X,^.|n5  =  yOiaU)) 

j=lk=] 

<fi  rf 

-z  =  XiicMl  =yvl)\ogp{xi  =Xit|nf  =  M-,p. 

;=lAr=l 

where  conforms  to  Wy .  Again  by  Shannon's  inequality, 

it  can  be  shown  the  above  expression  is  less  than  zero 
because  Bs  is  not  a  I-map,  and  there  are  indexes  j  and  k  such 

that  p{Xi  =  x«,|n^  =  ^  p(Xi  =  Xik\Ti.)  =wfj).  Thus 

we  conclude  asN^oo, 

-N  X  H(Bs .D^)  +  Nx  H(Bf> ) 

-  N(-H(Bs  )+H(B^p  ,D^ ))  ~>  -00  a.5. 

Although  the  third  term  goes  to  infinity,  because  N  term 
dominate  logA^  term  as  ->•  00,  the  difference  between  two 
MDL  scores  will  converge  to  minus  infinity  almost  surely.  || 


Based  on  Theorem  3  and  Theorem  4,  we  have  the  following 
algorithm.  Given  a  database  with  N  independent  samples 
distributed  with  a  absolutely  positive  joint  pdf  p,  to  construct 

a  minimal  I-map  of  p  that  obeys  a  node  order,  we  begin 
from  a  fully  connected  network  structure  j5|  that  obeys  the 
same  node  order.  Secondly,  we  examine  all  edges  in  to 
determine  which  edge  is  in  Bs  which  is  not.  To  do  so, 

we  deleting  one  edge  at  a  time  in  B|to  generate  a  new 
network  structure  Bs  .  If  the  score  difference  between  Bs  and 
Bs  is  greater  than  zero,  that  is 

we  believe  with  certain  degree  of  confidence  that  the  edge 
just  deleted  is  not  in  Bs  to  Theorem  4,  otherwise,  it  is 

in  Bs  to  Theorem  3.  Note  that  there  are  only  n{n~\)ll 
Bs  's  to  be  examined,  and  each  one  has  only  one  edge 
difference  from  .  Finally,  those  edges  which  make  the 

score  difference  less  than  zero  will  be  collected  and  comprise 
the  minimal  I-map.  This  algorithm  is  computationally 
efficient  and  guarantees  to  converge  to  the  minimal  I-map 
that  obeys  the  given  order  provided  that  the  sample  size  is 
sufficiently  large. 


A  Bayesian  Approach  (K2) 


As  mentioned  before,  The^  philosophy  of  Bayesian  learning 
methods,  in  principle,  is  based  on  a  so  called  score  function 
which  is  proportional  to  the  posterior  probability  p{Bsf)N)  of 
di  network  structure  Bs  given  database  Dn^  A  network 
structure  Bs  which  maximizes  the  score  function  is 
considered  to  be  the  most  likely  structure  generating  the 
database.  Cooper  and  Herskovits  [3]  used  p(Bs,  Dn)  as  a 
score  function,  and  based  on  a  number  of  assumptions,  they 
proved,  for  a  discrete  database  Dat,  that 


p(B5,DA;)=cnn 


1=1  >=1 


(3.7) 


where  c  is  the  prior  probability,  P(Bs\  for  each  Bs,  and  r, ,  qi, 
Nijk,  and  Nij  are  defined  as  in  MDL  score.  Bouckaert  [3] 
investigated  the  difference  between  MDL  score  and  Bayesian 
score,  and  concluded  that 


LiBs,D^)^\o^P{Bs,D^)-^o(\)  (3.8) 

where  o(i)  is  with  respect  to  N.  That  is  as  oo,  c>(l) 
tends  to  a  constant.  However,  the  behavior  when  it  tends  to 
the  limit  depends  on  both  Bs  and  Dn.  From  formula  (3.8), 
we  can  show  that  the  algorithm  developed  for  MDL  score  can 
also  be  used  for  the  Bayesian  score.  The  only  difference  is 

that  when  examining  if  the  deleted  edge  is  in  ,  we  use  the 

log  of  Bayesian  score.  By  slightly  rewriting  Theorem  3  and 
Theorem  4,  it  can  be  shown  that  the  algorithm  also  converges 
to  the  minimal  I-map  provided  the  sample  size  in  the 
database  is  sufficiently  large.  In  addition,  the  computational 
efficiency  is  the  same  as  in  MDL. 


4.  CONCLUSIONS 

In  this  paper,  a  number  of  attributes  about  a  Bayesian 
network  and  score  functions  are  identified.  A  set  of  theorems 
is  derived  based  on  which  we  derive  two  categories  of 
Bayesian  network  learning  algorithms,  one  for  conditional 
independent  test,  and  the  other  for  MDL  score  and  Bayesian 
score.  Throughout  the  derivation,  we  show  that  the  new 
algorithms  are  considerably  simple  and  computationally 
efficient,  and  guarantee  to  converge  to  a  minimal  I-map  of 
the  probability  distribution  which  generates  the  database,  if 
the  sample  size  is  sufficiently  large.  The  algorithms 
developed  in  this  paper  need  to  be  evaluated  using  a 
simulation  and  real  data,  particularly  for  a  database  with  a 
finite  sample  size.  Learning  Bayesian  network  from  database 
is  a  dijSicult  task  because  there  is  a  huge  number  of  possible 
network  structures  to  be  considered.  Although  it  is  known 
that  given  any  node  order,  there  must  exist  a  minimal  I-map 
of  imderlying  distribution,  an  improperly-chosen  node  order 
may  lead  to  a  network  which  fails  to  reveal  as  much 
conditional  independence  as  desired.  Hence,  to  develop  an 
algorithm  that  can,  without  relying  on  a  prespecified  node 
order,  lead  to  a  sparse  I-map  among  all  minimal  I-map  is  an 
important  fiiture  research  direction. 
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